With bookshelf.js it is easy enough to fetch all records for a given model using Model.fetchAll and then loop through them, like so:
SomeModel.fetchAll().then(function(results) {
results.models.forEach(function(model) {
...
});
});
But this loads the entire result set all at once, which is impractical for very large result sets. Is there a simple way to load the results in smaller batches (e.g. only 1000 at a time, say).
I know it's possible to do this by maintaining an offset counter and using limit() and offset() to roll my own version of this, but really I'm looking for something that hides the nuts and bolts, analogous to ActiveRecord's find_in_batches.
But I can't find anywhere in the docs or from a google search if a batched fetcher method even exists. Is there a simple way to do this?
Related
I'm at a crossroads trying to decide what methodology to use. Basically, I have a mongodb collection and i want to query it with specific params provided by the user, then i want to group the response according to the value of some of those parameters. For example, let's say my collection is animals and if i query all animals i get something like this
[
{type:"Dog",age:3,name:"Kahla"},
{type:"Cat",age:6,name:"mimi"},
...
]
Now i would like to return to the user a response that is grouped by the animal type, so that i end up with something like
{
Dogs: [...dog docs],
Cats: [...cat docs],
Cows: [...],
}
So basically I have 2 ways of doing this. One is to just use Model.find() and fetch all the animals that match my specific queries, such as age or any other field, and then manually filter and format my json response before sending it back to the user with res.json({}) (im using express btw)
Or I can use mongo's aggregate framework and $group to do this at the query level, hence returning from the DB an already grouped response to my request. The only inconvenience I've found with this so far with this is how the response is formatted, and ends up looking more something like this
[
{
"_id":"Dog",
"docs":[{dog docs...}]
},
{
"_id":"Cat",
"docs":[{...}]
}
]
The overall result is BASICALLY the same, but the formatting of the response is quite different, and my front end client needs to adjust to how Im sending the response. I don't really like the array of objects from the aggregation, and prefer a json-like object response with key names correponding to the arrays as I see fit.
So the real question here is whether there is one significant advantage of one way over the other? Is the aggregation framework so fast that it will scale well if my collection grows to huge numbers? Is filtering through the data with javascript and mapping the response so I can shape it to my liking a very inefficient process, and hence it's better to use aggregation and adapt the front end to this response shape?
I'm considering that by Faster you meant the least time to serve a request. That said, let's divide the time required to process your request:
Asynchronous Operations (Network Operations, File read/write etc)
Synchronous Operations
Synchronous operations are usually much more faster than the Asynchronous ones.(This also depends on the nature of the operation and the amount of data being processed). For example, if you loop over an iterable(e.g. Array, Map etc) which has a length of less than 1000 it won't take more than a few milliseconds.
On the other hand, Asynchronous operations takes more times. For example, if you run an HTTP request it would take couple of milliseconds to get the response.
When you are querying on the MongoDB with mongoose, it's an asynchronous call and it will take more time. So, if you run more queries to Database it will make your API slower. MongoDB Aggregation can help you reducing the total number of queries which may help you to make APIs faster. But the problem is, Aggregations are usually slower than normal find requests.
The summary is, if you can manually filter data without adding any DB query it's going to be faster.
I currently have a table in my Postgres database with about 115k rows that I feel is too slow for my serverless functions. The only thing I need that table for is to lookup values using functions like ILIKE and the network barrier is slowing things down a lot I believe.
My thought was to take the table and make it into a javascript array of objects as it doesn't change often if ever. Now that I have it in a file such as array.ts and inside is:
export default [
{}, {}, {},...
]
What is the best way to query this huge array? Is it best to just use the .filter function? I currently am trying to import the array and filter it but it seems to just hang and never actually complete. MUCH slower the the current DB approach so I am unsure if this is the right approach.
Make the database faster
As people have commented, it's likely that the database will actually perform better than anything else given that databases are good at indexing large data sets. It may just be a case of adding the right index, or changing the way your serverless functions handle the connection pool.
Make local files faster
If you want to do it without the database, there are a couple of things that will make a big difference:
Read the file and then use JSON.parse, do not use require(...)
JavaScript is much slower to parse than JSON. You can therefore make things load much faster by parsing it as JavaScript.
Find a way to split up the data
Especially in a serverless environment, you're unlikely to need all the data for every request, and the serverless function will probably only serve a few requests before it is shutdown and a new one is started.
If you could split your files up such that you typically only need to load an array of 1,000 or so items, things will run much faster.
Depending on the size of the objects, you might consider having a file that contains only the id of the objects & the fields needed to filter them, then having a separate file for each object so you can load the full object after you have filtered.
Use a local database
If the issue is genuinely the network latency, and you can't find a good way to split up the files, you could try using a local database engine.
#databases/sqlite can be used to query an SQLite database file that you could pre-populate with your array of values and index appropriately.
const openDatabase = require('#databases/sqlite');
const {sql} = require('#databases/sqlite');
const db = openDatabase('mydata.db');
async function query(pattern) {
await db.query(sql`SELECT * FROM items WHERE item_name LIKE ${pattern}`);
}
query('%foo%').then(results => console.log(results));
I've got a long-running pipeline that has some failing items (items that at the end of the process are not loaded because they fail database validation or something similar).
I want to rerun the pipeline, but only process the items that failed the import on the last run.
I have the system in place where I check each item ID (that I received from external source). I do this check in my loader. If I already have that item ID in the database, I skip loading/inserting that item in the database.
This works great. However, it's slow, since I do extract-transform-load for each of these items, and only then, on load, I query the database (one query per item) and compare item IDs.
I'd like to filter-out these records sooner. If I do it in transformer, I can only do it per item again. It looks like extractor could be the place, or I could pass records to transformer in batches and then filter+explode the items in (first) transformer.
What would be better approach here?
I'm also thinking about reusability of my extractor, but I guess I could live with the fact that one extractor does both extract and filter. I think the best solution would be to be able to chain multiple extractors. Then I'd have one that extracts the data and another one that filters the data.
EDIT: Maybe I could do something like this:
already_imported_item_ids = Items.pluck(:item_id)
Kiba.run(
Kiba.parse do
source(...)
transform do |item|
next if already_imported_item_ids.include?(item)
item
end
transform(...)
destination(...)
end
)
I guess that could work?
A few hints:
The higher (sooner) in the pipeline, the better. If you can find a way to filter out right from the source, the cost will be lower, because you do not have to manipulate the data at all.
If you have a scale small enough, you could load only the full list of ids at the start in a pre_process block (mostly what you have in mind in your code sample), then compare right after the source. Obviously it doesn't scale infinitely, but it can work a long time depending on your dataset size.
If you need to have a higher scale, I would advise to either work with a buffering transform (grouping N rows) that would achieve a single SQL query to verify the existence of all the N rows ids in the target database, or work with groups of rows then explode indeed.
I am trying to get a total number of count of some documents. I am using mongoose count query. When I am trying to find documents like 85k, at that time, It's taking 12 seconds. I need to reduce the time to 2 or 3 seconds.
It just an example, There could be several hundreds of thousands of data which has to be counted. I think it will take too much time.
Here is the query which I am using to count documents
Donor.count(find_cond, function (er, doc) {
console.log(doc, "doc")
});
when it will count 10k to 20k, It's fine. when It goes to more than that it will too much time-consuming, It should not.
I know it is a little late but will write for future reference. After looking for a while, the best way I found to count documents is to use estimatedDocumentCount(), which uses collection metadata.
Another way for a very large collection (over 200k documents) is through the Model.collection.stats() method which will return an object with a key "count" like in this example:
const stats = await User.collection.stats();
const userCount = stats.count
It's still not great but the performance is much much better than countDocuments().
Can you try something like this..
Donor.createIndex({field1:1, field2:1, field3:1});
Donor.find({"field1" : "val1", "field2" : "val2"}).sort({field3: -1}).limit(100000).lean().count().exec();
Index is used for fast retrieval of data from database.
Performance can be improved by optimal equality -> sort -> range index.
Also, objects returned when using lean() are plain Javascript objects. Usually, Mongoose objects are returned in normal query.
This article provides useful guidelines for mongodb performance improvement.
Use Index for the field you try to get the count of.
I have tried searching similar answers and have applied the solutions. The solutions seem not to work in my case. I am querying a mongoose collection that contains 60k documents, and I need all 60k to apply combinatorics. Hence, can't apply limits. I can reduce the data volume by querying multiple times depending on a certain property, but that will be costly in terms of performance as well. I don't see what else to try. Can someone help me out?
I am using this simple code for now:
StagingData.find({})
.lean()
.exec(function(err, results){
console.log(results) //I don't get any output
}
When I use:
let data = await StagingData.find({}).lean() //it takes forever
What should I do?
You might want to apply indexing first, e.g. precomputing some values as a separate operation, parallel processing, etc. For this you may want to jump to a a different technology maybe Elasticsearch, Spark etc depend on your code.
You may also want to identify what is the bottleneck in your process: memory, processor. Try experimenting with a shorter set of documents and see how quickly you get results. With this you might be able to infer how long will it take to do it for the whole dataset.
You may also trying breaking down your operation into smaller chunks and identifying the cost of processing etc.