How to find all documents in a database created by many different users? - search

I need to find all documents in a database which have been created by x number of users and the result must be a combined (sorted by date) list/collection of documents from all these users.
so I have a multivalue field that contain e.g 100 users, I now need to return a collection programmatically with all the documents in the database that has been created with those users.
worth mentioning here is that the 100 users is dynamic, so in another document there might be 100 different users that is the be used for the search.
I have experimented with the following kind of query, but I believe I run into some kind of limit in the search query. (looks like if the query is to long I get query is not understandable)
FIELD CreatedBy Contains "Thomas" OR FIELD CreatedBy Contains "Peter".... up to 100 more like this
also, finding these documents is triggered by webusers so it must be relative fast.
is there another way to find these documents?
Thanks
Thomas

Your best bet might be a view sorted by the creator and a loop with 100 getAllDocumentsByKey. You could run a comparison if you sort the entry field and just walk through the viewnavigator after jumping to the first document.
The result could be added to a folder (you need to manage folders for different query sets and have a cleanup routine) that is sorted how you need it or you sort in memory (think JavaBean). The folder option makes paging to the result easier - you probably don't want to show 10000 results in one go - that would take very long to transmit. You also could use a "result document" where you store the result as JSON and use that as paging source.

Easily done with Ytria's ScanEZ. They have trials, I think

Related

Can you find a specific documents position in a sorted Azure Search index

We have several Azure Search indexes that use a Cosmos DB collection of 25K documents as a source and each index has a large number of document properties that can be used for sorting and filtering.
We have a requirement to allow users to sort and filter the documents and then search and jump to a specific documents page in the paginated result set.
Is it possible to query an Azure Search index with sorting and filtering and get the position/rank of a specific document id from the result set? Would I need to look at an alternative option? I believe there could be a way of doing this with a SQL back-end but obviously that would be a major undertaking to implement.
I've yet to find a way of doing this other than writing a query to paginate through until I find the required document which would be a relatively expensive and possibly slow task in terms of processing on the server.
There is no mechanism in Azure Search for filtering within the resultset of another query. You'd have to page through results, looking for the document ID on the client side. If your queries aren't very selective and produce many pages of results, this can be slow as $skip actually re-evaluates all results up to the page you specify.
You could use caching to make this faster. At least one Azure Search customer is using Redis to cache search results. If your queries are selective enough, you could even cache the results in memory so you'd only pay the cost of paging once.
Trying this at the moment. I'm using a two step process:
Generate your query but set $count=true and $top=0. The query result should contain a field named #odata.count.
You can then pick an index, then use $top=1 and $skip=<index> to return a single entry. There is one caveat: $skip will only accept numbers less than 100000

CouchDB - human readable id

Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?
You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.

Range-based, chronological pagination queries across multiple collections with MongoDB?

Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.
Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.
Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.
MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.
Option A: Get Ids from supporting collection, load full docs, sort in memory
So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.
{
_id: ObjectId,
updated: Date,
type: String
}
This way allows mongo to do the pagination for you.
Option B: 3 Queries, Union, Sort, Limit
Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.
So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.
If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:
A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.
B) Cache everything aggressively so that you rarely need to do the merges
You could use the same idea I explained here, although this post is related to MongoDB text search it applies to any kind of query
MongoDB Index optimization when using text-search in the aggregation framework
The idea is to query all your collections ordering them by date and id, then sort/mix the results in order to return the first page. Subsequent pages are retrieved by using last document's date and id from the previous page.

SOLR/Lucene weighting by user-centric criteria

We are switching from SQL Fulltext Search to Lucene (SOLR stack) search in the next few months. One last wrinkle in figuring out our strategy here has to with replicating one current part of our search platform.
First, some nomenclature to describe the problem: Our site has a bunch of documents. People might "add" those documents, they might "favorite" those documents, they might "read" those documents, etc. Let's call that union of such documents for a given user their "personal documents". Some documents are public, and some are private so that only the logged-in-user can see them.
Currently, we have a weighting function that will always show a given user's "personal" documents FIRST in the search list, for any search. This outranks the normal order (but a document must be valid in the result set -- it just ranks above any other less important document). In SQL, we are able to achieve this by having a user-defined-function that returns a score, and it varies by user.
An analogy is Facebook -- where, when you type "Joe", it will first find all the Joes that you know, followed by any other Joe that meets the criteria. My search for "Joe" will return a different ordered set than your search for Joe.
In the world of Lucene/SOLR, as I understand it, I cannot figure out how to have such user-centric weighting of documents without two separate queries that are then effectively UNIONed together (I know, it's not relational, but you get the idea). We have millions of users, and hundreds of thousands of documents. If a user is logged in, we want "their documents" to show up first in any search, then the rest of all documents. And in each case, we want the search results to show only those documents that match the original search -- we're just talking about rank-order.
Can you think of any strategies here to reproduce this user-defined-function feature?
Can you afford to have a field in each document telling this particular document belongs to Jim (e.g. user123Doc:1)? If yes, you could solve it by sorting the result set by {user123Doc, score, ...}.
Or, if you don't want to store this information in Lucene, you can store this elsewhere (e.g. in the database) and implement FieldComparator so it works with these values. More on this is available here.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources