RavenDB: How to query many different documents in a single query? - search

I am new to RavenDB and I'm not sure how to address this issue.
I have a document store with around 200 different document types. Each type can contain thousands of documents.
In my business logic all the different document types are treated the same - they can be all mapped to a generic object such as a DataTable.
I would like to query all the properties of all the documents from all types in a single free text search. What is the best way to do that?

You can do this using multi maps. Take a look at this post:
http://ayende.com/blog/156225/relational-searching-sucks-donrsquo-t-try-to-replicate-it

Related

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

DocumentDB data structure misunderstanding

I'm starting a new website project and i would like to use DocumentDB as database instead of traditional RDBMS.
I will need two kind of documents to store:
User documents, they will hold all the user data.
Survey documents, that will hold all data about survays.
May i put both kind in a single collection or should i create one collection for each?
How you do this is totally up to you - it's a fairly broad question, and there are good reasons for combining, and good reasons for separating. But objectively, you'll have some specific things to consider:
Each collection has its own cost footprint (starting around $24 per collection).
Each collection has its own performance (RU capacity) and storage limit.
Documents within a collection do not have to be homogeneous - each document can have whatever properties you want. You'll likely want some type of identification property that you can query on, to differentiate document types, should you store them all in a single collection.
Transactions are collection-scoped. So, for example, if you're building server-side stored procedures and need to modify content across your User and Survey documents, you need to keep this in mind.

DocumentDB: get all documents of same entity type

I'm storing documents of several different types (entity types?) in a single collection. What would be the best way get all documents of a certain type (like you would do with select * from a table).
Options I see so far:
Include the type as a property. But that would mean a looking into every document when getting the documents, right?
Prepend the type name to the document id and try searching by id with typename*.
Is there a better way to do this?
There's no built-in entity-type property, but you can certainly create your own, and ensure that it's indexed. At this point, it's as straightforward as adding a WHERE clause:
WHERE docs.docType = "SomeType"
Assuming it's a hash-based index, this should provide efficient lookups and filter out unwanted document types.
While you can embed the type into a property (such as document id), you'd then have to do partial string matches, which won't be as efficient as an indexed-property comparison.
If you're curious to know what this query is costing you, the RU value is displayed both in the portal and via x-ms-request-charge return header.
I agree with David's answer and using a single docType field is what I did when I first started using DocumentDB. However, there is another option that I started using after doing some experiments. That is to create an is<Type> field and setting its value to true. This is slightly more efficient for queries than using a single string field, because the indexes themselves are smaller partial indexes, but could potentially take up slightly more storage space.
The other advantage to this approach is that it provides advantages for inheritance and mixins. For example, I have both isLookup=true and isState=true on certain entities. I also have other lookup types. Then in my application code, some behaviors are common for all lookup fields and other behaviors are only applicable to the State type.
If you index the type property on the collection, it will not be a complete scan.

Data relationships as a context for search in Marklogic

I using marklogic's search functionality to create a search page. As of right now, I'm running an XQuery to get search results through search:search. As a bare bones example, see this code:
xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
search:search('test',
<options xmlns='http://marklogic.com/appservices/search'></options>)
This search searches all content in the database, which is fine in many cases. In other cases, I search based on collections with cts:collection-query. The collections serve as great contexts for my searches.
Now, I would like to limit my search results based on a relationship of data in a "main" document. This "main" document has all the relationships in an object model. If that object model has a reference to a document, I want that document included in the search. Essentially, the "main"/model document is the context of the search.
I was trying to brainstorm some ideas of the best way to to this. Here's what I've come up with thus far, but I was hoping someone more familiar with Marklogic (I've only been working with it for 6 months) could lead me in a good direction:
Add all documents referenced in the model document to a unique collection. Then query search based on that collection. However, the collections would have to be updated as the model changed.
Load the model document into my code and get a list of all the references and add them to a query by cts:document-query (or the like).
Restructure my concept of a "model" somehow in my XML documents.
Thanks for any input or suggestions.
I would start with (2) and see if the performance is good enough. That will depend on your use-case, but I expect it should be fine for thousands or even hundreds of thousands of references.
Be sure to use a single-term cts:document-query($list-of-references). That will be faster than cts:or-query(for $ref in $list-of-references return cts:document-query($ref)), because the index lookup can be a single pass instead of N separate lookups.
All of these ideas would work fine. Deciding which to use depends on particulars of your application such as how often the main document is changed (and are you in control of it),
how hard it is to remodel your XML.
Another thing to consider is you can set a trigger on document updates which could perform the collection changes automatically.
-David Lee

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources