Range-based, chronological pagination queries across multiple collections with MongoDB? - node.js

Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.
Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.
Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.

MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.
Option A: Get Ids from supporting collection, load full docs, sort in memory
So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.
{
_id: ObjectId,
updated: Date,
type: String
}
This way allows mongo to do the pagination for you.
Option B: 3 Queries, Union, Sort, Limit
Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.
So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.

If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:
A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.
B) Cache everything aggressively so that you rarely need to do the merges

You could use the same idea I explained here, although this post is related to MongoDB text search it applies to any kind of query
MongoDB Index optimization when using text-search in the aggregation framework
The idea is to query all your collections ordering them by date and id, then sort/mix the results in order to return the first page. Subsequent pages are retrieved by using last document's date and id from the previous page.

Related

Retrieve data from three different unrelated collections in a single query

Question:
i’m using the Node MongoDB driver. I’m trying to determine whether i should write a single query that gets data from three collections or whether the database needs to have one collection with references or embedded documents etc… that joins these three unrelated collections.
User case:
During search i get an array of objects, i take the first 10 from the array, each object is meta data about a document belonging in one of the three collections. The collections are unrelated but have some common fields and this meta data is the only way to go get information at later stages.
For example, during search i get and store this array in React state (see example object below), then when the user clicks on a search result, i have to go and loop inside this array so that i can go grab the relevant metadata to be able to retrieve more content…
Example Object inside Array of Objects (Meta data):
[{
collection: 'pmc_test',
id_field: 'id_int',
id_type: 'int',
id_value: 2657156
},
{
collection: 'arxiv',
id_field: 'id_int',
id_type: 'int',
id_value: 2651582
},
{
collection: 'crossref',
id_field: 'DOI',
id_type: 'string',
id_value: "10.1098/rsbm.1955.0005"
},
...] // different collections, usually passed with 10 objects
However to display the 10 search results to begin with i have to loop over each object in the array, modify and run a query which could result in 10 separate queries. So i can at least minimise this by doing 3 queries using the $in operator and provide three arrays of IDs representing each collection.
This is still multiple queries, i have to go to the 1st collection, then 2nd collection, then 3rd collection and then combine all the results together for display search results. This is what i'm trying to avoid. This is how each of the three collections roughly look like.
Any suggestions on what querying approach i could use? Will the database benefit from having a single collection / approach that will avoid having to use the meta data to look in three different collections?
Currently this is a massive breaking change to the application resulting in at least 15 features / api calls needing updates, i'd like to maintain the ability to query one collection and suggest this as an optimal change.
Thanks in advance.
Edit
Example collections here:
Arxiv collection: https://gist.github.com/Natedeploys/6734dffccea7b293ca16b5bd7c73a6b6
Crossref collection:
https://gist.github.com/Natedeploys/9b0d3b02c665d7507ed75c9d5fbff159
Pubmed collection (pmc_test):
https://gist.github.com/Natedeploys/09527e8ceaf5d3f0f70ba28984b87a73
You can do all these operations by mongodb aggregation , in your case lookup and group stages will applicable , for further please share (1 document) json data of each collection so it would easy to guide

Mongoose: how to use index in aggregate?

How can I use indexes in aggregate?
I saw the document https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
Is there any way of using index not the beginning situation?
like $sort,
$match or $group
Please help me
An index works by keeping a record of certain pieces of data that point to a given record in your collection. Think of it like having a novel, and then having a sheet of paper that lists the names of various people or locations in that novel with the page numbers where they're mentioned.
Aggregation is like taking that novel and transforming the different pages into an entirely different stream of information. You don't know where the new information is located until the transformation actually happens, so you can't possibly have an index on that transformed information.
In other words, it's impossible to use an index in any aggregation pipeline stage that is not at the very beginning because that data will have been transformed and MongoDB has no way of knowing if it's even possible to efficiently make use of the newly transformed data.
If your aggregation pipeline is too large to handle efficiently, then you need to limit the size of your pipeline in some way such that you can handle it more efficiently. Ideally this would mean having a $match stage that sufficiently limits the documents to a reasonably-sized subset. This isn't always possible, however, so additional effort may be required.
One possibility is generating "summary" documents that are the result of aggregating all new data together, then performing your primary aggregation pipeline using only these summary documents. For example, if you have a log of transactions in your system that you wish to aggregate, then you could generate a daily summary of the quantities and types of the different transactions that have been logged for the day, along with any other additional data you would need. You would then limit your aggregation pipeline to only these daily summary documents and avoid using the normal transaction documents.
An actual solution is beyond the scope of this question, however. Just be aware that the index usage is a limitation that you cannot avoid.

Can you find a specific documents position in a sorted Azure Search index

We have several Azure Search indexes that use a Cosmos DB collection of 25K documents as a source and each index has a large number of document properties that can be used for sorting and filtering.
We have a requirement to allow users to sort and filter the documents and then search and jump to a specific documents page in the paginated result set.
Is it possible to query an Azure Search index with sorting and filtering and get the position/rank of a specific document id from the result set? Would I need to look at an alternative option? I believe there could be a way of doing this with a SQL back-end but obviously that would be a major undertaking to implement.
I've yet to find a way of doing this other than writing a query to paginate through until I find the required document which would be a relatively expensive and possibly slow task in terms of processing on the server.
There is no mechanism in Azure Search for filtering within the resultset of another query. You'd have to page through results, looking for the document ID on the client side. If your queries aren't very selective and produce many pages of results, this can be slow as $skip actually re-evaluates all results up to the page you specify.
You could use caching to make this faster. At least one Azure Search customer is using Redis to cache search results. If your queries are selective enough, you could even cache the results in memory so you'd only pay the cost of paging once.
Trying this at the moment. I'm using a two step process:
Generate your query but set $count=true and $top=0. The query result should contain a field named #odata.count.
You can then pick an index, then use $top=1 and $skip=<index> to return a single entry. There is one caveat: $skip will only accept numbers less than 100000

How to find all documents in a database created by many different users?

I need to find all documents in a database which have been created by x number of users and the result must be a combined (sorted by date) list/collection of documents from all these users.
so I have a multivalue field that contain e.g 100 users, I now need to return a collection programmatically with all the documents in the database that has been created with those users.
worth mentioning here is that the 100 users is dynamic, so in another document there might be 100 different users that is the be used for the search.
I have experimented with the following kind of query, but I believe I run into some kind of limit in the search query. (looks like if the query is to long I get query is not understandable)
FIELD CreatedBy Contains "Thomas" OR FIELD CreatedBy Contains "Peter".... up to 100 more like this
also, finding these documents is triggered by webusers so it must be relative fast.
is there another way to find these documents?
Thanks
Thomas
Your best bet might be a view sorted by the creator and a loop with 100 getAllDocumentsByKey. You could run a comparison if you sort the entry field and just walk through the viewnavigator after jumping to the first document.
The result could be added to a folder (you need to manage folders for different query sets and have a cleanup routine) that is sorted how you need it or you sort in memory (think JavaBean). The folder option makes paging to the result easier - you probably don't want to show 10000 results in one go - that would take very long to transmit. You also could use a "result document" where you store the result as JSON and use that as paging source.
Easily done with Ytria's ScanEZ. They have trials, I think

SOLR - How to have facet counts restricted to rows returned in resultset

/select/?q=*:*&rows=100&facet=on&facet.field=category
I have around 100 000 documents indexed. But I return only 100 documents using rows=100. The facet counts returned for category, however return the counts for all documents indexed.
Can we somehow restrict the facets to the result set returned? i.e 100 rows only?
I don't think it is possible in any direct manner, as was pointed out by Pascal.
I can see two ways to achieve this:
Method I: do the counting yourself visiting the 100 results returned. This is very easy and fast if they are categorical fields, but harder if they are text fields that need to be tokenized, etc.
Method II: do two passes:
Do a normal query without facets (you only need to request doc ids at this point)
Collect all the IDs of the documents returned
Do a second query for all fields and facets, adding a filter to restrict result to those IDs collected in setp 2. Something like:
select/?q=:&facet=on&facet.field=category&fq=id:(312
OR 28 OR 1231 ...)
The first is way more efficient and I would recommend for non-textual filds. The second is computationally expensive but has the advantage of working for all types od fields.
Sorry, but i don't think it is possible. The facets are always based on all the documents matching the query.
Not a real answer but maybe better than nothing: the results grouping feature (check out from trunk!):
http://wiki.apache.org/solr/FieldCollapsing
where facet.field=category is then similar to group.field=category and you will get only as much groups ('facet hits') as you specified!
If you always execute the same query (q=*:*), maybe you can use facet.limit, for example:
select/?q=*:*&rows=100&facet=on&facet.field=category&facet.limit=100
Tell us if the order that solr uses is the same in the facet as in the query :.

Resources