How to perform massive data uploads to firebase firestore - node.js

I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.

Related

Some trivial transactions take dozens of seconds to complete on Spanner microinstance

Here are some bits of context.
Nodejs server, connecting to Cloud Spanner from development machine.
Most of the time the queries take like 200-400ms including data transfer from servers location to my dev machine.
But sometimes these trivial transaction takes 12-16 seconds which surely not acceptable for use case - sessions storage for backend server.
In local dev context sessions service runs on same machine as main backend, at staging at prod they run in same Kubernetes cluster.
This is not about amount of data, it is very small amount of data now in our staging Spanner database overall, like few MB across all tables and just like 10 rows in the table under question.
Spanner instance stats:
Processing units: 100
CPU utilization: 4.3% for the staging database and 10% overall for instance.
Table is like so (few other small fields omitted):
CREATE TABLE sessions
(
id STRING(255) NOT NULL,
created TIMESTAMP,
updated TIMESTAMP,
status STRING(16),
is_local BOOL,
user_id STRING(255),
anonymous BOOL,
expires_at TIMESTAMP,
last_activity_at TIMESTAMP,
json_data STRING(MAX),
) PRIMARY KEY(id);
Transaction under question makes single question like this:
UPDATE ${schema.reportsTable}
SET ${statusCol.columnName} = #status_recycled
WHERE ${idCol.columnName} = #id_value
AND ${statusCol.columnName} = #status_active
with parameters like this:
{
"id_value": "some_session_id",
"status_active": "active",
"status_recycled": "recycled"
}
Yes, that status field of STRING(16) with readable names instead of boolean field is not ideal, I know, but this concept is inherited from an older code. What concerns me is that while we do not have yet too much of data there, just 10 rows or such, experience this sort of delays is surely unexpected at this scale.
Okay, I understand I am like on other side of the globe from the Spanner servers, but this usually gives delays between 200-1200 ms, not 12-16 seconds.
Delay happens quite rarely and randomly but seems to happen on queries like this.
The delay comes at commit, not at e. g. sending SQL command itself or obtaining a transaction.
I tried different query first, like
DELETE FROM Sessions WHERE id = #id_value
and it was the same - random rare long delay of 12-16 such trivial query.
Thanks a lot for your help and time.
PS: Update: actually this 12-16 seconds delay can happen at any random transaction in described context, and all of these transactions are standard CRUD single-row operations.
Update 2:
The code that sends transaction is own wrapper over the standard #google-cloud/spanner client library for nodejs.
The library gives just an easy to use wrapping around the Spanner instance, database, and transaction.
The Spanner instance and database objects are long-living singletons, I mean they do not recreated for every transaction from scratch.
The main purpose of that wrapper is to give logic like:
let result = await useDataContext(async(ctx) => {
let sql = await ctx.getSQLRunner();
return await sql.runSQLUpdate({
sql: `Some SQL Trivial Statement`,
parameters: {
param1: 1,
param2: true,
param3: "some string"
}
});
});
purpose of that is to give some warrantees that if some changes were made over data, transaction.commit surely will be called, and if no changes were made, transaction.end will be called, and if an error boom in the called code, like invalid SQL generated or some variable will be undefined or null, transaction rollback will be initiated.

Batch requests and concurrent processing

I have a service in NodeJS which fetches user details from DB and sends that to another application via http. There can be millions of user records, so processing this 1 by 1 is very slow. I have implemented concurrent processing for this like this:
const userIds = [1,2,3....];
const users$ = from(this.getUsersFromDB(userIds));
const concurrency = 150;
users$.pipe(
switchMap((users) =>
from(users).pipe(
mergeMap((user) => from(this.publishUser(user)), concurrency),
toArray()
)
)
).subscribe(
(partialResults: any) => {
// Do something with partial results.
},
(err: any) => {
// Error
},
() => {
// done.
}
);
This works perfectly fine for thousands of user records, it's processing 150 user records concurrently at a time, pretty faster than publishing users 1 by 1.
But problem occurs when processing millions of user records, getting those from database is pretty slow as result set size also goes to GBs(more memory usage also).
I am looking for a solution to get user records from DB in batches, while keep on publishing those records concurrently in parallel.
I thinking of a solution like, maintain a queue(of size N) of user records fetched from DB, whenever queue size is less than N, fetch next N results from DB and add to this queue.
Then the current solution which I have, will keep on getting records from this queue and keep on processing those concurrently with defined concurrency. But I am not quite able to put this in code. Is there are way we can do this using RxJS?
I think your solution is the right one, i.e. using the concurrent parameter of mergeMap.
The point that I do not understand is why you are adding toArray at the end of the pipe.
toArray buffers all the notifications coming from upstream and will emit only when the upstream completes.
This means that, in your case, the subscribe does not process partial results but processes all of the results you have obtained executing publishUser for all users.
On the contrary, if you remove toArray and leave mergeMap with its concurrent parameter, what you will see is a continuous flow of results into the subscribe due to the concurrency of the process.
This is for what rxjs is concerned. Then you can look at the specific DB you are using to see if it supports batch reads. In which case you can create buffers of user ids with the bufferCount operator and query the db with such buffers.

AWS DocumentDB Performance Issue with Concurrency of Aggregations

I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?

will I hit maximum writes per second per database if I make a document using Promise.all like this?

I am now developing an app. and I want to send a message to all my users inbox. the code is like this in my cloud functions.
const query = db.collection(`users`)
.where("lastActivity","<=",now)
.where("lastActivity",">=",last30Days)
const usersQuerySnapshot = await query.get()
const promises = []
usersQuerySnapshot.docs.forEach( userSnapshot => {
const user = userSnapshot.data()
const userID = user.userID
// set promise to create data in user inbox
const p1 = db.doc(`users/${userID}/inbox/${notificationID}`).set(notificationData)
promises.push(p1)
})
return await Promise.all(promises)
there is a limit in Firebase:
Maximum writes per second per database 10,000 (up to 10 MiB per
second)
say if I send a message to 25k users (create a document to 25K users),
how long the operations of that await Promise.all(promises) will take place ? I am worried that operation will take below 1 second, I don't know if it will hit that limit or not using this code. I am not sure about the operation rate of this
if I hit that limit, how to spread it out over time ? could you please give a clue ? sorry I am a newbie.
If you want to throttle the rate at which document writes happen, you should probably not blindly kick off very large batches of writes in a loop. While there is no guarantee how fast they will occur, it's possible that you could exceed the 10K/second/database limit (depending on how good the client's network connection is, and how fast Firestore responds in general). Over a mobile or web client, I doubt that you'll exceed the limit, but on a backend that's in the same region as your Firestore database, who knows - you would have to benchmark it.
Your client code could simply throttle itself with some simple logic that measures its progress.
If you have a lot of documents to write as fast as possible, and you don't want to throttle your client code, consider throttling them as individual items of work using a Cloud Tasks queue. The queue can be configured to manage the rate at which the queue of tasks will be executed. This will drastically increase the amount of work you have to do to implement all these writes, but it should always stay in a safe range.
You could use e.g. p-limit to reduce promise concurrency in the general case, or preferably use batched writes.

Alternatives to MongoDB cursor.toArray() in node.js

I am currently using MongoDB cursor's toArray() function to convert the database results into an array:
run = true;
count = 0;
var start = process.hrtime();
db.collection.find({}, {limit: 2000}).toArray(function(err, docs){
var diff = process.hrtime(start);
run = false;
socket.emit('result', {
result: docs,
time: diff[0] * 1000 + diff[1] / 1000000,
ticks: count
});
if(err) console.log(err);
});
This operation takes about 7ms on my computer. If I remove the .toArray() function then the operation takes about 0.15ms. Of course this won't work because I need to forward the data, but I'm wondering what the function is doing since it takes so long? Each document in the database simply consists of 4 numbers.
In the end I'm hoping to run this on a much smaller processor, like a Raspberry Pi, and here the operation where it fetches 500 documents from the database and converts it to an array takes about 230ms. That seems like a lot to me. Or am I just expecting too much?
Are there any alternative ways to get data from the database without using toArray()?
Another thing that I noticed is that the entire Node application slows remarkably down while getting the database results. I created a simple interval function that should increment the count value every 1 ms:
setInterval(function(){
if(run) count++;
}, 1);
I would then expect the count value to be almost the same as the time, but for a time of 16 ms on my computer the count value was 3 or 4. On the Raspberry Pi the count value was never incremented. What is taking so much CPU usage? The monitor told me that my computer was using 27% CPU and the Raspberry Pi was using 92% CPU and 11% RAM, when asked to run the database query repeatedly.
I know that was a lot of questions. Any help or explanations are much appreciated. I'm still new to Node and MongoDB.
db.collection.find() returns a cursor, not results, and opening a cursor is pretty fast.
Once you start reading the cursor (using .toArray() or by traversing it using .each() or .next()), the actual documents are being transferred from the database to your client. That operation is taking up most of the time.
I doubt that using .each()/.next() (instead of .toArray(), which—under the hood—uses one of those two) will improve the performance much, but you could always try (who knows). Since .toArray() will read everything in memory, it may be worthwhile, although it doesn't sound like your data set is that large.
I really think that MongoDB on Raspberry Pi (esp a Model 1) is not going to work well. If you don't depend on the MongoDB query features too much, you should consider using an alternative data store. Perhaps even an in-memory storage (500 documents times 4 numbers doesn't sound like lots of RAM is required).

Resources