I am using mongoDB with mongoose for our Nodejs api where we need to do sort of seed for collections where data-source is a JSON, i am using Model.bulkwrite which internally uses mongodb's Bulkwrite(https://docs.mongodb.com/manual/core/bulk-write-operations).
Code below,
await Model.bulkWrite(docs.map(doc => ({
updateOne: { ..... } // update document
insertOne: { ....... } // insert document
updateOne: { ..... } // update document
insertOne: { ....... } // insert document
.
.
.n
})))
This works fine for our current use-case with just few hundred documents,
But we are worried about how will it scale,its performance when the number of documents will increase a lot, Like will there be any issues when number of document will be in 10 thousands.
Just want to confirm that are we on the right path or is there any room for improvement.
Bulkwrite in Mongodb is currently having maximum limit of 100,000 write operations in a single batch. From the docs
The number of operations in each group cannot exceed the value of the maxWriteBatchSize of the database. As of MongoDB 3.6, this value is 100,000. This value is shown in the isMaster.maxWriteBatchSize field.
This limit prevents issues with oversized error messages. If a group
exceeds this limit, the client driver divides the group into smaller
groups with counts less than or equal to the value of the limit. For
example, with the maxWriteBatchSize value of 100,000, if the queue
consists of 200,000 operations, the driver creates 2 groups, each with
100,000 operations.
So, you won't face any performance issues until you exceed this limit.
For your reference:
Mongodb Bulkwrite: db.collection.bulkWrite()
Write Command Batch Limit Size
Related
I need to get 300k data with NodeJS from the database and pass it to the client(HTML side) for reporting and virtual scroll. I started with MySQL and it was taking around 14 seconds (2 tables joined). Since I'd like to speed up, I changed it to MongoDB but it takes around 31.2 seconds in one collection without any joining with the native MongoDB npm package.
NodeJS MongoDB source code
const { MongoClient } = require('mongodb');
.
.
.
const db = client.db(dbName);
const collection = db.collection('test');
const startTime = performance.now()
const findResult = await collection.find({}).toArray();
const endTime = performance.now();
console.log(`Time elapsed: ${endTime - startTime} milliseconds`);
console.log('Found documents 1st element =>', findResult[0]);
I'd assume MongoDB was going to take less time versus MySql but I guess there is something wrong with my code (I also tried fastify-mongodb, fast-json-stringify, streaming but none of them could not speed up ). What do you suggest? Because I researched lots of papers and most of the people gets +3m data around 10seconds and I know MongoDB is a good choice for big data even my data is not big enough :)
In general mongoDB is good in multiple small single document reads based on indexes(already in memory) , this is coming from the fact that the data is allocated on storage in random 32k compressed blocks and it is not written sequentially , so in case you read the full collection there will be multiple IOPS operations to your storage that turns out to be the bottleneck for sequential reads in your case. If you immediately try to read the same collection second time you will find that the read is much more faster since the documents are already cached in memory and OS fs cache. To improve this performance for bigger volumes you may shard the collection and distribute the data in multiple shards so it use less IOPS per individual shard when fetching the full collection. Btw by 300k data you mean 300k documents or total size of collection is 300k bytes since 31 seconds is huge amount of time , maybe documents are relatively big in this 300k collection or you have some infrastructure issue?
(300000 x 16)/(1024*1024) = ~ 4,5TB max , how big is your collection?
FileSystem? Storage ?
Testing in my machine it seems there is no issue in mongo but in the stages after you fetch the data:
db.collection.find({}).explain("executionStats")
"executionStats" : {
"nReturned" : 300000,
"executionTimeMillis" : 174,
"totalKeysExamined" : 0,
"totalDocsExamined" : 300000
Some interesting results:
mongos> var a=new Date().getTime();var doc=db.collection.find({});var b=new Date().getTime();var doc2=doc.toArray();var c=new Date().getTime();printjson(doc2);d=new Date().getTime();var ba=b-a,cb=c-b,dc=d-c;print("find:"+ba+"ms toArray:"+cb+"ms printjson(doc2):"+dc+"ms");
doc=find({}): 0ms
doc2=doc.toArray(): 3296ms
printjson(doc2): 67122ms
Something that you may try:
Increase the cursor batch size :
var doc=db.collection.find({}).batchSize(3000000)
( In my tests higher batch sizes provided 2-3 times better execution times )
Use eachAsync paralelized , check here
( havent check that , but you may try )
I have a service in NodeJS which fetches user details from DB and sends that to another application via http. There can be millions of user records, so processing this 1 by 1 is very slow. I have implemented concurrent processing for this like this:
const userIds = [1,2,3....];
const users$ = from(this.getUsersFromDB(userIds));
const concurrency = 150;
users$.pipe(
switchMap((users) =>
from(users).pipe(
mergeMap((user) => from(this.publishUser(user)), concurrency),
toArray()
)
)
).subscribe(
(partialResults: any) => {
// Do something with partial results.
},
(err: any) => {
// Error
},
() => {
// done.
}
);
This works perfectly fine for thousands of user records, it's processing 150 user records concurrently at a time, pretty faster than publishing users 1 by 1.
But problem occurs when processing millions of user records, getting those from database is pretty slow as result set size also goes to GBs(more memory usage also).
I am looking for a solution to get user records from DB in batches, while keep on publishing those records concurrently in parallel.
I thinking of a solution like, maintain a queue(of size N) of user records fetched from DB, whenever queue size is less than N, fetch next N results from DB and add to this queue.
Then the current solution which I have, will keep on getting records from this queue and keep on processing those concurrently with defined concurrency. But I am not quite able to put this in code. Is there are way we can do this using RxJS?
I think your solution is the right one, i.e. using the concurrent parameter of mergeMap.
The point that I do not understand is why you are adding toArray at the end of the pipe.
toArray buffers all the notifications coming from upstream and will emit only when the upstream completes.
This means that, in your case, the subscribe does not process partial results but processes all of the results you have obtained executing publishUser for all users.
On the contrary, if you remove toArray and leave mergeMap with its concurrent parameter, what you will see is a continuous flow of results into the subscribe due to the concurrency of the process.
This is for what rxjs is concerned. Then you can look at the specific DB you are using to see if it supports batch reads. In which case you can create buffers of user ids with the bufferCount operator and query the db with such buffers.
For example, we have a bank record, we use a query to get all the bank's record, I just wanted to create a function who simply return the total bank record and return number only
Do you mean the total number of records in CouchDB or just a particular type of record?
Anyhow, I'll propose solutions for both assuming you're using CouchDB as your state DB.
Reading the total number of records present in CouchDB from chaincode will just be a big overhead. You can simply make a GET API call like this http://couchdb.server.com/mydatabase and you'd get a JSON back looking something like this:
{
"db_name":"mydatabase",
"update_seq":"2786-g1AAAAFreJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8riYGB0RuPuiQFIJlkD1Naik-pA0hpPExpDj6lCSCl9TClwXiU5rEASYYGIAVUPR-sPJqg8gUQ5fvBygMIKj8AUX4frDyOoPIHEOUQt0dlAQB32XIg",
"sizes":{
"file":13407816,
"external":3760750,
"active":4059261
},
"purge_seq":0,
"other": {
"data_size":3760750
},
"doc_del_count":0,
"doc_count":2786,
"disk_size":13407816,
"disk_format_version":6,
"data_size":4059261,
"compact_running":false,
"instance_start_time":"0"
}
From here, you can simply read the doc_count value.
However, if you want to read the total number of docs in chaincode, then I should mention that it'll be a very costly operation and you might get a timeout error if the number of records is very high. For a particular type of record, you can use Couchdb selector syntax.
If you want to read all the records, then you can use getStateByRange(startKey, endKey) method and count all the records.
This question already has answers here:
How can I update more than 500 docs in Firestore using Batch?
(8 answers)
Closed 2 years ago.
I am trying to update more than 500 documents like 1000 to 2000 documents and i only know how to use batch for 500 documents i wanted to ask how can i update more than 500 documents using cloud firestore.here is how i am trying to update 500 documents. i am trying to update ms_timestamp for 1000 documents. can anyone tell me how can I do it using batch write
const batch = db.batch();
const campSnapshot = await db
.collection("camp")
.where("status", "in", ["PENDING", "CONFIRMED"])
.get();
await db.collection("camping").doc(getISO8601Date()).set({
trigger: campSnapshot.docs.length,
});
campSnapshot.forEach((docs) => {
const object = docs.data();
object.ms_timestamp = momentTz().tz("Asia/Kolkata").valueOf();
batch.set(
db.collection("camp").doc(docs.get("campId")),
object,
{ merge: true }
);
});
await Promise.all([batch.commit()]);
Cloud Firestore imposes a limit of 500 documents when performing a Transaction or Batched Write, and you can not change this, but a workaround may just work.
I am not an expert in web dev, so I am sharing a suggestion based on my viewpoint as a mobile app developer.
Create a collection that store a counter on how many documents that are contained within a specific collection. I update the counter through Cloud Functions/other approach, when an event (be it created, updated, or deleted) is fired within that specific collection. The counter should be atomic and consistent, and you can leverage Cloud Firestore Transaction here.
Fetch the counter value before performing batched writes. Here, I will know how many data/objects/documents that need to be updated.
Create an offset with initial value is 0. The offset is used to mark the data. A batched writes can only be performed for up to 500 documents, so if I want to perform a batched write again on document/data at 501-1000, then the offset will be 500, and so on.
Call a method that perform batched writes recursively using the defined offset, until it fully equals the counter - 1.
I do not test this since I don't have enough time right now, but I think it'll work.
You can comment if you still not understand, I'll be glad to help further.
I'm building a chat app with different groups. Therefore I'm using a collection in Mongodb (one for all groups). This is my message schema:
const MessageSchema = mongoose.Schema({
groupId: Number,
userId: Number,
messageIat: Date,
message: String,
reactions: []
});
Let's say I want to load the last 50 messages of the group with the id 10.
To sort the messages I'm using the default ObjectId.
I'm using the following query. For me, It seems like that I'm loading all messages of group 10, then sorting it to ensure the order and then I can limit the results. But this seems not very efficiently to me. If there are a lot messages it will cost quite some time right?
return Message.find({groupId:10}).sort( {_id: -1 }).limit(50)
My first try was to do the limit operation at first, but then I can not rely on the order so what's the commen way for this?
Is it more common to split it up , so to have a collection per group?
Thanks for helping.
According to docs:
For queries that include a sort operation without an index, the server
must load all the documents in memory to perform the sort before
returning any results.
So first off, make sure to create an index for whatever field you're going to sort the results by.
Also,
The sort can sometimes be satisfied by scanning an index in order. If
the query plan uses an index to provide the requested sort order,
MongoDB does not perform an in-memory sorting of the result set
Moreover, according to this page, the following queries are equivalent:
db.bios.find().sort( { name: 1 } ).limit( 5 )
db.bios.find().limit( 5 ).sort( { name: 1 } )
Finally, as longs as indices fit entirely in memory, you should be fine with your current approach. Otherwise you might want to consider doing some manual partitioning.