I'm creating an app in react-native with a nodejs backend. The app is almost done, and I'm now stress testing the backend.
In my postgresql database, I have a table called notifications to store all the notifications a user receives.
In my app, a user can follow pages. When a page posts a new message, I want to send a notification to all users following that page. Every user should receive an individual notification.
Let's say a page is followed by 1 million users, and the page posts a new message: this means 1 million notifications (eg. 1 million rows) should be inserted into the database.
My solution (for now) is by chunking up the array of user ID's (of the users following the page) into chunks of 1000 user ID's each, and doing an insert query using every chunk.
const db = require('./db');
const format = require('pg-format');
const userIds = [1,2, 3, 4, 5, ..., 1000000];
// split up the user ID's array into chunks of 1000 user ID's each
const chunks = chunkArray(userIds, 1000); // chunkArray is a function that splits up an array
into multiple arrays with x items, in this case x = 1000;
// loop over each chunk
chunks.forEach(chunk => {
const array = [];
// create an array containing 1000 objects, each containing a user ID, notification type and
// page ID (for inserting into the database)
chunk.forEach(userId => {
array.push({ userId, type: 'post', pageId: _PAGE_ID_ });
});
// create and run the query
const query = format("INSERT INTO notifications (userId, type, pageId) VALUES %L", array);
const result = await db.query(query);
});
I'm using node-postgres for the database connection, and I'm creating a connection pool. I fetch one client from the pool, so only 1 connection is used for all the queries in the forEach-loop.
This all works, but inserting 1 million rows takes a few minutes. I'm not sure this is the right way to do this.
Another solution I came up with is using "general notifications". When a page updates a post, I only insert 1 notification into the notifications table, and when I query for all notifications for a specific user, I check which pages the user is following, and fetch all the general notifications of that page with the query. Would this be a better solution? I would leave me with A LOT less notification-rows and I think it would increase performance.
Thank you for all responses!
I'm trying to implement my other solution. When a page updates a post, I insert only one notification without a user ID (because it has no specific destination), but with the page ID.
When I fetch all the notifications for a user, I first check for all the notifications with that user ID, and for all notification without a user ID but with a page ID of a page that the user is following.
I think this is not the easiest solution, but it will reduce the number of rows and if I do a good job with indexes and stuff, I think I'm able to write a pretty performant query.
Without getting into which solution would be better, one way to solve it could be this, provided that you keep all the pages and followers in the same database.
INSERT INTO notifications (userId, type, pageID)
SELECT users.id, 'post', pages.id
from pages
join followers on followers.pageId = pages.id
join users on followers.userId = users.id
where pages.id = _PAGE_ID_
This would allow the DB to handle everything, which should speed up the insert since you don't need to send each individual row from the server.
If you don't have the users/pages in the same DB then it's a bit more tricky.
You could prepare a CSV file, upload it to the database server and use the COPY command. If you don't have access to the server, you might be able to stream the data directly as the COPY command can read from stdin (that depends on the library, I'm not familiar with node-postgres so I can't tell if it's possible.)
Alternatively you can do everything in a transaction by issuing a BEGIN before you do the inserts, this is the slowest alternative, but save time in the overhead of postgres creating an implicit transaction for each statement. Just don't forget to commit after. The library might even have ways to create explicit transactions and insert data through.
That said, I would probably do a variation of your second solution since it would create less rows in the DB, but that depends on your other requirements, it might not be possible if you need to track notifications or perform other actions on them.
use async.eachOfLimit to insert X chunks in parallel
In the following example you will insert 10 chunks in parallel
const userIds = [1,2, 3, 4, 5, ..., 1000000];
const chunks = chunkArray(userIds, 1000);
var BATCH_SIZE_X = 10;
async.eachOfLimit(chunks, BATCH_SIZE_X, function(c, i, ecb){
c = c.map(function(e){ return { e, type: 'post', pageId: _PAGE_ID_ });
const query = format("INSERT INTO notifications (userId, type, pageId) VALUES %L", c);
const result = await db.query(query);
return ecb();
}, function(err){
if(err){
}
else{
}
});
Related
I use MongoDB to store user data. The user id goes incrementally, such as 1, 2, 3, 4 etc when new user register.
I have the following code to generate the user id. "users" is the name of the collection where I store the user data.
// generate new user id
let uid;
const collections = await db.listCollections().toArray();
const collectionNames = collections.map(collection => collection.name);
if(collectionNames.indexOf("users") == -1){
uid = 1;
}
else{
const newest_user = await db.collection("users").find({}).sort({"_id":-1}).limit(1).toArray();
uid = newest_user[0]["_id"] + 1;
}
user._id = uid;
// add and save user
db.collection("users").insertOne(user).catch((error)=>{
throw error;
});
One concern I have now is that when two users make a request to register at the same time, they will get same maximum user id, and create the same new user id. One way to prevent it is using a locked thread. But, I think Node.js and Next.js doesn't support multi-thread.
What are some alternatives I have to solve this problem?
In addition, _id will be the field for uid. Will it make a difference since _id can't be duplicated.
Why not have the database generate the auto-incrementing ID? https://www.mongodb.com/basics/mongodb-auto-increment
One idea I have is using a transaction which can solve the concurrency issue. Transactions obey the rule of ACID. The writes to the database from the concurrent requests will run in isolation.
I have written a function which gets a Querysnapshot within all changed Documents of the past 24 hours in Firestore. I loop through this Querysnapshot to get the relevant informations. The informations out of this docs I want to save into maps which are unique for every user. Every user generates in average 10 documents a day. So every map gets written 10 times in average. Now I'm wondering if the whole thing is scalable or will hit the 500 writes per transaction limit given in Firebase as more users will use the app.
The limitation im speaking about is documented in Google documentation.
Furthermore Im pretty sure that my code is really slow. So im thankful for every optimization.
exports.setAnalyseData = functions.pubsub
.schedule('every 24 hours')
.onRun(async (context) => {
const date = new Date().toISOString();
const convertedDate = date.split('T');
//Get documents (that could be way more than 500)
const querySnapshot = await admin.firestore().collectionGroup('exercises').where('lastModified', '>=', `${convertedDate}`).get();
//iterate through documents
querySnapshot.forEach(async (doc) => {
//some calculations
//get document to store the calculated data
const oldRefPath = doc.ref.path.split('/trainings/');
const newRefPath = `${oldRefPath[0]}/exercises/`;
const document = await getDocumentSnapshotToSave(newRefPath, doc.data().exercise);
document.forEach(async (doc) => {
//check if value exists
const getDocument = await admin.firestore().doc(`${doc.ref.path}`).collection('AnalyseData').doc(`${year}`).get();
if (getDocument && getDocument.exists) {
await document.update({
//map filled with data which gets added to the exisiting map
})
} else {
await document.set({
//set document if it is not existing
}, {
merge: true
});
await document.update({
//update document after set
})
}
})
})
})
The code you have in your question does not use a transaction on Firestore, so is not tied to the limit you quote/link.
I'd still recommend putting a limit on your query through, and processing the documents in reasonable batches (a couple of hundred being reasonable) so that you don't put an unpredictable memory load on your code.
I was attempting to fetch all documents from a collection in a Node.js environment. The documentation advises the following:
import * as admin from "firebase-admin";
const db = admin.firestore();
const citiesRef = db.collection('cities');
const snapshot = await citiesRef.get();
console.log(snapshot.size);
snapshot.forEach(doc => {
console.log(doc.id, '=>', doc.data());
});
I have 20 documents in the 'cities' collection. However, the logging statement for the snapshot size comes back as 0.
Why is that?
Edit: I can write to the Firestore without issue. I can also get details of a single document, for example:
const city = citiesRef.doc("city-name").get();
console.log(city.id);
will log city-name to the console.
Ensure that Firebase has been initialized and verify the collection name matches your database exactly, hidden spaces and letter case can break the link to Firestore. One way to test this is to create a new document within the collection to validate the path.
db.collection('cities').doc("TEST").set({test:"value"}).catch(err => console.log(err));
This should result in a document in the correct path, and you can also catch it to see if there are any issues with Security Rules.
Update
To list all documents in a collection, you can do this with the admin sdk through a server environment such as the Cloud Functions using the listDocuments() method but this does not reduce the number of Reads.
const documentReferences = await admin.firestore()
.collection('someCollection')
.listDocuments()
const documentIds = documentReferences.map(it => it.id)
To reduce reads, you will want to aggregate the data in the parent document or in a dedicated collection, this would double the writes for any updates but crush read count to a minimal amount.
Shortly, imagine I have a Cloud Firestore DB where I store some users data such as email, geo-location data (as geopoint) and some other things.
In Cloud Functions I have "myFunc" that runs trying to "link" two users between them based on a geo-query (I use GeoFirestore for it).
Now everything works well, but I cannot figure out how to avoid this kind of situation:
User A calls myFunc trying to find a person to be associated with, and finds User B as a possible one.
At the same time, User B calls myFunc too, trying to find a person to be associated with, BUT finds User C as possible one.
In this case User A would be associated with User B, but User B would be associated with User C.
I already have a field called "associated" set to FALSE on each user initialization, that becomes TRUE whenever a new possible association has been found.
But this code cannot guarantee the right association if User A and User B trigger the function at the same time, because at the moment in which the function triggered by User A will find User B, the "associated" field of B will be still set to false because B is still searching and has not found anybody yet.
I need to find a solution otherwise I'll end up having
wrong associations ( User A pointing at User B, but User B pointing at User C ).
I also thought about adding a snapshotListener to the user who is searching, so in that way if another User would update the searching user's document, I could terminate the function, but I'm not really sure it will work as expected.
I'd be incredibly grateful if you could help me with this problem.
Thanks a lot!
Cheers,
David
HERE IS MY CODE:
exports.myFunction = functions.region('europe-west1').https.onCall( async (data , context) => {
const userDoc = await firestore.collection('myCollection').doc(context.auth.token.email).get();
if (!userDoc.exists) {
return null;
}
const userData = userDoc.data();
if (userData.associated) { // IF THE USER HAS ALREADY BEEN ASSOCIATED
return null;
}
const latitude = userData.g.geopoint["latitude"];
const longitude = userData.g.geopoint["longitude"];
// Create a GeoQuery based on a location
const query = geocollection.near({ center: new firebase.firestore.GeoPoint(latitude, longitude), radius: userData.maxDistance });
// Get query (as Promise)
let otherUser = []; // ARRAY TO SAVE THE FIRST USER FOUND
query.get().then((value) => {
// CHECK EVERY USER DOC
value.docs.map((doc) => {
doc['data'] = doc['data']();
// IF THE USER HAS NOT BEEN ASSOCIATED YET
if (!doc['data'].associated) {
// SAVE ONLY THE FIRST USER FOUND
if (otherUser.length < 1) {
otherUser = doc['data'];
}
}
return null;
});
return value.docs;
}).catch(error => console.log("ERROR FOUND: ", error));
// HERE I HAVE TO RETURN AN .update() OF DATA ON 2 DOCUMENTS, IN ORDER TO UPDATE THE "associated" and the "userAssociated" FIELDS OF THE USER WHO WAS SEARCHING AND THE USER FOUND
return ........update({
associated: true,
userAssociated: otherUser.name
});
}); // END FUNCTION
You should use a Transaction in your Cloud Function. Since Cloud Functions are using the Admin SDK in the back-end, Transactions in a Cloud Function use pessimistic concurrency controls.
Pessimistic transactions use database locks to prevent other operations from modifying data.
See the doc form more details. In particular, you will read that:
In the server client libraries, transactions place locks on the
documents they read. A transaction's lock on a document blocks other
transactions, batched writes, and non-transactional writes from
changing that document. A transaction releases its document locks at
commit time. It also releases its locks if it times out or fails for
any reason.
When a transaction locks a document, other write operations must wait
for the transaction to release its lock. Transactions acquire their
locks in chronological order.
I am using Firebase cloud code and firebase realtime database.
My database structure is:
-users
-userid32
-userid4734
-flag=true
-userid722
-flag=false
-userid324
I want to query only the users who's field 'flag' is 'true' .
What I am doing currently is going over all the users and checking one by one. But this is not efficient, because we have a lot of users in the database and it takes more than 10 seconds for the function to run:
const functions = require('firebase-functions');
const admin = require("firebase-admin");
admin.initializeApp(functions.config().firebase);
exports.test1 = functions.https.onRequest((request, response) => {
// Read Users from database
//
admin.database().ref('/users').once('value').then((snapshot) => {
var values = snapshot.val(),
current,
numOfRelevantUsers,
res = {}; // Result string
numOfRelevantUsers = 0;
// Traverse through all users to check whether the user is eligible to get discount.
for (val in values)
{
current = values[val]; // Assign current user to avoid values[val] calls.
// Do something with the user
}
...
});
Is there a more efficient way to make this query and get only the relevant records? (and not getting all of them and checking one by one?)
You'd use a Firebase Database query for that:
admin.database().ref('/users')
.orderByChild('flag').equalTo(true)
.once('value').then((snapshot) => {
const numOfRelevantUsers = snapshot.numChildren();
When you need to loop over child nodes, don't treat the resulting snapshot as an ordinary JSON object please. While that may work here, it will give unexpected results when you order on a value with an actual range. Instead use the built-in Snapshot.forEach() method:
snapshot.forEach(function(userSnapshot) {
console.log(userSnapshot.key, userSnapshot.val());
}
Note that all of this is fairly standard Firebase Database usage, so I recommend spending some extra time in the documentation for both the Web SDK and the Admin SDK for that.