How do I optimize working with large datasets in MongoDB

How do I optimize working with large datasets in MongoDB - node.js

We have multiple collections of about 10,000 documents (this will become increasingly more in the future) that are generated in node.js, and need to be stored/queried/filtered/projected multiple times for which we have a mongodb aggregation pipeline. Once certain conditions are met, the documents are regenerated and stored.
Everything worked fine when we had 5,000 documents. We inserted them as an array in a single document, and used unwind in the aggregation pipeline. However, at a certain point the documents no longer fits in a single document because it exceeds the 16 MB document size limit. We needed to store everything in bulk, and add some identifiers to know what 'collection' they belong to so we can use the pipeline on those documents only.
Problem: Writing the files, which is necessary before we can query them in a pipeline, is problematically slow. The bulk.execute() part can easily take 10 - 15 seconds. Adding them to an array in node.js and writing the <16 MB doc to MongoDB only takes a fraction of a second.
bulk = col.initializeOrderedBulkOp();
for (var i = 0, l = docs.length; i < l; i++) {
bulk.insert({
doc : docs[i],
group : group.metadata
});
}
bulk.execute(bulkOpts, function(err, result) {
// ...
}
How can we address the bulk writing overhead latency?
Thoughts so far:
A memory based collection temporarily handling queries while data is being written to disk.
Figure if Memory Storage Engine (Alert: considered beta and not for production) is worth MongoDB Enterprise licensing.
Perhaps the WiredTiger storage engine has improvements over MMAPv1 other than compression and encryption.
Storing a single (array) document anyway, but split it into <16 MB chunks.

Related

Firestore Query performance issue on Firebase Cloud Functions

I am facing timeout issues on a firebase https function so I decided to optimize each line of code and realized that a single query is taking about 10 seconds to complete.
let querySnapshot = await admin.firestore()
.collection("enrollment")
.get()
The enrollment collection has about 23k documents, totaling approximately 6MB.
To my understanding, since the https function is running on a cloud function stateless server, it should not suffer from the query result size. Both Firestore and Cloud Functions are running on the same region (us-central). Yet 10 seconds is indeed a high interval of time for executing such a simple query that results in a small snapshot size.
An interesting fact is that later in the code I update those 23k documents back with a new field using Bulk Writter and it takes less than 3 seconds to run bulkWriter.commit().
Another fact is that the https function is not returning any of the query results to the client, so there shouldn't be any "downloading" time affecting the function performance.
Why on earth does it take 3x longer to read values from a collection than writing to it? I always thought Firestore architecture was meant for apps with high reading rates rather than writing.
Is there anything you would propose to optimize this?

When we perform the get(), a query is created to all document snapshots and the results are returned. These results are fetched sequentially within a single execution, i.e. the list is returned and parsed sequentially until all documents have been listed.
While the data may be small, are there any subcollections? This may add some additional latency as the API fetches and parses subcollections.
Updating the fields with a bulk writer update is over 3x the speed because the bulkwriter operation is performed in parallel and is queued based upon Promises. This allows many more operations per second.
The best way to optimize listing all documents is summarised in this link, and Google’s recommendation follows the same guideline being to use an index for faster queries and to use multiple readers that fetch the documents in parallel.

ArangoDB - Slow query performance on cluster

I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.
FOR existing IN ExistingCollection
LET matches = (
FOR temp IN TempCollection
FILTER temp._key == existing._key
RETURN true
)
FILTER LENGTH(matches) == 0
RETURN existing
When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.
However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!
Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.
My question: can this type of operation be done efficiently on a cluster?
I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.

Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.
You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.
LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
LET TempCollection = (FOR temp IN c1 RETURN temp._key)
RETURN MINUS(ExistingCollection, TempCollection)

which is faster: views or allDocs with Array.filter?

I was wondering about performance differences between dedicated views in CouchDb/PouchDb VS simply retrieving allDocs plus filtering them with Array.prototype.filter later on.
Let's say we want to get 5,000 todo docs stored in a database.
// Method 1: get all tasks with a dedicated view "todos"
// in CouchDB
function (doc) {
if (doc.type == "todo"){
emit(doc._id);
}
}
// on Frontend
var tasks = (await db.query('myDesignDoc/todos', {include_docs: true})).rows;
// Method 2: get allDocs, and then filter via Array.filter
var tasks = (await db.allDocs({include_docs: true})).rows;
tasks = tasks.filter(task => {return task.doc.type == 'todo'});
What's better? What are the pros and cons of each of the 2 methods?

The use of the view will scale better. But which is "faster" will depend on so many factors that you will need to benchmark for your particular case on your hardware, network and data.
For the "all_docs" case, you will effectively be transferring the entire database to the client, so network speed will be a large factor here as the database grows. If you do this as you have, by putting all the documents in an array and then filtering, you're going to hit memory usage limits at some point - you really need to process the results as a stream. This approach is O(N), where N is the number of documents in the database.
For the "view" case, a B-Tree index is used to find the range of matching documents. Only the matching documents are sent to the client, so the savings in network time and memory depend on the proportion of matching documents from all documents. Time complexity is O(log(N) + M) where N is the total number of documents and M is the number of matching documents.
If N is large and M is small then this approach should be favoured. As M approaches N, both approaches are pretty much the same. If M and N are unknown or highly variable, use a view.
You should consider one other thing - do you need the entire document returned? If you need only a few fields from large documents then views can return just those fields, reducing network and memory usage further.
Mango queries may also be of interest instead of views for this sort of query. You can create an index over the "type" field if the dataset size warrants it, but it's not mandatory.
Personally, I'd use a Mango query and add the index if/when necessary.

Mongoose insertMany limit

Mongoose 4.4 now has an insertMany function which lets you validate an array of documents and insert them if valid all with one operation, rather than one for each document:
var arr = [{ name: 'Star Wars' }, { name: 'The Empire Strikes Back' }];
Movies.insertMany(arr, function(error, docs) {});
If I have a very large array, should I batch these? Or is there no limit on the size or array?
For example, I want to create a new document for every Movie, and I have 10,000 movies.

I'd recommend based on personal experience, batch of 100-200 gives you good blend of performance without putting strain on your system.
insertMany group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the queue consists of 2000 operations, MongoDB creates 2 groups, each with 1000 operations.
The sizes and grouping mechanics are internal performance details and are subject to change in future versions.
Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list since with an ordered list, each operation must wait for the previous operation to finish.

Mongo 3.6 update:
The limit for insertMany() has increased in Mongo 3.6 from 1000 to 100,000.
Meaning that now, groups above 100,000 operations will be divided into smaller groups accordingly.
For example: a queue that has 200,000 operations will be split and Mongo will create 2 groups of 100,000 each.

The method takes that array and starts inserting them through the insertMany method in MongoDB, so the size of the array itself actually depends on how much your machine can handle.
But please note that there is another point, which is not a limitation but something worth keeping into consideration, on how MongoDB deals with multiple operations, by default it handles a batch of 1000 operations at a time and splits whats more than that.

ArangoDB - Performance issue with AQL query

I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo

Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string