How to write more than 500 documents using batch in cloud firestore? [duplicate] - node.js

This question already has answers here:
How can I update more than 500 docs in Firestore using Batch?
(8 answers)
Closed 2 years ago.
I am trying to update more than 500 documents like 1000 to 2000 documents and i only know how to use batch for 500 documents i wanted to ask how can i update more than 500 documents using cloud firestore.here is how i am trying to update 500 documents. i am trying to update ms_timestamp for 1000 documents. can anyone tell me how can I do it using batch write
const batch = db.batch();
const campSnapshot = await db
.collection("camp")
.where("status", "in", ["PENDING", "CONFIRMED"])
.get();
await db.collection("camping").doc(getISO8601Date()).set({
trigger: campSnapshot.docs.length,
});
campSnapshot.forEach((docs) => {
const object = docs.data();
object.ms_timestamp = momentTz().tz("Asia/Kolkata").valueOf();
batch.set(
db.collection("camp").doc(docs.get("campId")),
object,
{ merge: true }
);
});
await Promise.all([batch.commit()]);

Cloud Firestore imposes a limit of 500 documents when performing a Transaction or Batched Write, and you can not change this, but a workaround may just work.
I am not an expert in web dev, so I am sharing a suggestion based on my viewpoint as a mobile app developer.
Create a collection that store a counter on how many documents that are contained within a specific collection. I update the counter through Cloud Functions/other approach, when an event (be it created, updated, or deleted) is fired within that specific collection. The counter should be atomic and consistent, and you can leverage Cloud Firestore Transaction here.
Fetch the counter value before performing batched writes. Here, I will know how many data/objects/documents that need to be updated.
Create an offset with initial value is 0. The offset is used to mark the data. A batched writes can only be performed for up to 500 documents, so if I want to perform a batched write again on document/data at 501-1000, then the offset will be 500, and so on.
Call a method that perform batched writes recursively using the defined offset, until it fully equals the counter - 1.
I do not test this since I don't have enough time right now, but I think it'll work.
You can comment if you still not understand, I'll be glad to help further.

Related

Batch requests and concurrent processing

I have a service in NodeJS which fetches user details from DB and sends that to another application via http. There can be millions of user records, so processing this 1 by 1 is very slow. I have implemented concurrent processing for this like this:
const userIds = [1,2,3....];
const users$ = from(this.getUsersFromDB(userIds));
const concurrency = 150;
users$.pipe(
switchMap((users) =>
from(users).pipe(
mergeMap((user) => from(this.publishUser(user)), concurrency),
toArray()
)
)
).subscribe(
(partialResults: any) => {
// Do something with partial results.
},
(err: any) => {
// Error
},
() => {
// done.
}
);
This works perfectly fine for thousands of user records, it's processing 150 user records concurrently at a time, pretty faster than publishing users 1 by 1.
But problem occurs when processing millions of user records, getting those from database is pretty slow as result set size also goes to GBs(more memory usage also).
I am looking for a solution to get user records from DB in batches, while keep on publishing those records concurrently in parallel.
I thinking of a solution like, maintain a queue(of size N) of user records fetched from DB, whenever queue size is less than N, fetch next N results from DB and add to this queue.
Then the current solution which I have, will keep on getting records from this queue and keep on processing those concurrently with defined concurrency. But I am not quite able to put this in code. Is there are way we can do this using RxJS?
I think your solution is the right one, i.e. using the concurrent parameter of mergeMap.
The point that I do not understand is why you are adding toArray at the end of the pipe.
toArray buffers all the notifications coming from upstream and will emit only when the upstream completes.
This means that, in your case, the subscribe does not process partial results but processes all of the results you have obtained executing publishUser for all users.
On the contrary, if you remove toArray and leave mergeMap with its concurrent parameter, what you will see is a continuous flow of results into the subscribe due to the concurrency of the process.
This is for what rxjs is concerned. Then you can look at the specific DB you are using to see if it supports batch reads. In which case you can create buffers of user ids with the bufferCount operator and query the db with such buffers.

Mongodb/Mongoose bulkwrite(upsert) performance issues

I am using mongoDB with mongoose for our Nodejs api where we need to do sort of seed for collections where data-source is a JSON, i am using Model.bulkwrite which internally uses mongodb's Bulkwrite(https://docs.mongodb.com/manual/core/bulk-write-operations).
Code below,
await Model.bulkWrite(docs.map(doc => ({
updateOne: { ..... } // update document
insertOne: { ....... } // insert document
updateOne: { ..... } // update document
insertOne: { ....... } // insert document
.
.
.n
})))
This works fine for our current use-case with just few hundred documents,
But we are worried about how will it scale,its performance when the number of documents will increase a lot, Like will there be any issues when number of document will be in 10 thousands.
Just want to confirm that are we on the right path or is there any room for improvement.
Bulkwrite in Mongodb is currently having maximum limit of 100,000 write operations in a single batch. From the docs
The number of operations in each group cannot exceed the value of the maxWriteBatchSize of the database. As of MongoDB 3.6, this value is 100,000. This value is shown in the isMaster.maxWriteBatchSize field.
This limit prevents issues with oversized error messages. If a group
exceeds this limit, the client driver divides the group into smaller
groups with counts less than or equal to the value of the limit. For
example, with the maxWriteBatchSize value of 100,000, if the queue
consists of 200,000 operations, the driver creates 2 groups, each with
100,000 operations.
So, you won't face any performance issues until you exceed this limit.
For your reference:
Mongodb Bulkwrite: db.collection.bulkWrite()
Write Command Batch Limit Size

will I hit maximum writes per second per database if I make a document using Promise.all like this?

I am now developing an app. and I want to send a message to all my users inbox. the code is like this in my cloud functions.
const query = db.collection(`users`)
.where("lastActivity","<=",now)
.where("lastActivity",">=",last30Days)
const usersQuerySnapshot = await query.get()
const promises = []
usersQuerySnapshot.docs.forEach( userSnapshot => {
const user = userSnapshot.data()
const userID = user.userID
// set promise to create data in user inbox
const p1 = db.doc(`users/${userID}/inbox/${notificationID}`).set(notificationData)
promises.push(p1)
})
return await Promise.all(promises)
there is a limit in Firebase:
Maximum writes per second per database 10,000 (up to 10 MiB per
second)
say if I send a message to 25k users (create a document to 25K users),
how long the operations of that await Promise.all(promises) will take place ? I am worried that operation will take below 1 second, I don't know if it will hit that limit or not using this code. I am not sure about the operation rate of this
if I hit that limit, how to spread it out over time ? could you please give a clue ? sorry I am a newbie.
If you want to throttle the rate at which document writes happen, you should probably not blindly kick off very large batches of writes in a loop. While there is no guarantee how fast they will occur, it's possible that you could exceed the 10K/second/database limit (depending on how good the client's network connection is, and how fast Firestore responds in general). Over a mobile or web client, I doubt that you'll exceed the limit, but on a backend that's in the same region as your Firestore database, who knows - you would have to benchmark it.
Your client code could simply throttle itself with some simple logic that measures its progress.
If you have a lot of documents to write as fast as possible, and you don't want to throttle your client code, consider throttling them as individual items of work using a Cloud Tasks queue. The queue can be configured to manage the rate at which the queue of tasks will be executed. This will drastically increase the amount of work you have to do to implement all these writes, but it should always stay in a safe range.
You could use e.g. p-limit to reduce promise concurrency in the general case, or preferably use batched writes.

Node: Check a Firebase db and execute a function when an objects time matches the current time

Background
I have a Node and React based application. I'm using Firebase for my storage and database. In my application users can fill out a form where they upload an image and select a time for the image to be added to their website. I save each image update as an object in my Firebase database like so. Images are arranged in order of ascending update time.
user-name: {
images: [
{
src: 'image-src-url',
updateTime: 1503953587727
}
{
src: 'image-src-url',
updateTime: 1503958424838
}
]
}
Scale
My applications db could potentially get very large with a lot of users and images. I'd like to ensure scalability.
Issue
How do I check when a specific image objects time has been met then execute a function? (I do not need assistance on the actual function that is being run just the checking of the db for a specific time.)
Attempts
I've thought about doing a cron job using node-cron that checks the entire database every 60s (users can only specify the minute the image will update, not the seconds.) Then if it finds a matching updateTime and executes my function. My concern is at a large scale that cron job will take a while to search the db and potentially miss a time.
I've also thought about when the user schedules a new update then dynamically create a specific cron job for that time. I'm unsure how to accomplish this.
Any other methods that may work? Are my concerns about node-cron not valid?
There are two approaches I can think of:
Keep track of the last timestamp you processed
Keep the "things to process" in a queue
Keep track of the last timestamp you processed
When you process items, you use the current timestamp as the cut-off point for your query. Something like:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
Now make sure to store this now somewhere (i.e. in your database) so that you can re-use it next time to retrieve the next batch of items:
var previous = ... previous value of now
var now = Date.now();
var query = ref.orderByChild("updateTime").startAt(previous).endAt(now);
With this you're only processing a single slice at a time. The only tricky bit is that somebody might insert a new node with an updateTime that you've already processed. If this is a concern for your use-case, you can prevent them from doing so with a validation rule on updateTime:
".validate": "newData.val() >= root.child('lastProcessed').val()"
As you add more items to the database, you will indeed be querying more items. So there is a scalability limit to this approach, but this approach should work well for anything up to a few hundreds of thousands of nodes (I haven't tested in a while so ymmv).
For a few previous questions on list size:
Firebase Performance: How many children per node?
Firebase Scalability Limit
How many records / rows / nodes is alot in firebase?
Keep the "things to process" in a queue
An alternative approach is to keep a queue of items that still need to be processed. So the clients add the items that they want processed to the queue with an updateTime of when they want to processed. And your server picks the items from the queue, performs the necessary updates, and removes the item from the queue:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
query.once("value").then(function(snapshot) {
snapshot.forEach(function(child) {
// TODO: process the child node
// remove the child node from the queue
child.ref.remove();
});
})
The difference with the earlier approach is that a queue's stable state is going to be empty (or at least quite small), so your queries will run against a much smaller list. That's also why you won't need to keep track of the last timestamp you processed: any item in the queue up to now is eligible for processing.

Request rate is large for remove request

When I try to find some docs in documentDB, all is good -
collection.find({query})
when I try to remove, all is bad
[mongooseModel | collection].remove({same-query})
I got
Request rate is large
The number of documents to remove ~ 10 000 . I have tested queries in robomongo shell, which limits find results to 50 per page. Also my remove query fails with mongoose. I can't understand such behavior. How can I got in to Request Limit while remove query is a single request?
Update
Count with query also raise same error.
db.getCollection('taxonomies').count({query})
"Request rate is large" indicates that the application has exceeded the provisioned RU quota, and should retry the request after a small time interval.
Since you are using DocumentDB Node.js API, you could check out #Larry Maccherone's answer in Request rate is large on how to avoid this issue by handling retry behavior and logic in your application's error handling routines.
More on this:
Dealing with RequestRateTooLarge errors in Azure DocumentDB and testing performance
Request Units in Azure Cosmos DB
This is an awful part of CosmosDB - I don't see how they intend to become a real player in the cloud database game with limitations like this. That being said - I came up with a hack to delete all records from a collection through the MongoDB API when you are bumping up against the "Rate Exceeded" error. See below:
var loop = true
while(loop){
try {
db.grid.deleteMany({"myField":"my-query"})
}
catch (err) {
print(err)
printjson(db.runCommand({getLastRequestStatistics:1}))
continue
}
loop = false
}
Just update the deleteMany query and run this script directly in your mongo shell and it should loop through and delete the records.

Resources