DocumentDB: get all documents of same entity type - azure

I'm storing documents of several different types (entity types?) in a single collection. What would be the best way get all documents of a certain type (like you would do with select * from a table).
Options I see so far:
Include the type as a property. But that would mean a looking into every document when getting the documents, right?
Prepend the type name to the document id and try searching by id with typename*.
Is there a better way to do this?

There's no built-in entity-type property, but you can certainly create your own, and ensure that it's indexed. At this point, it's as straightforward as adding a WHERE clause:
WHERE docs.docType = "SomeType"
Assuming it's a hash-based index, this should provide efficient lookups and filter out unwanted document types.
While you can embed the type into a property (such as document id), you'd then have to do partial string matches, which won't be as efficient as an indexed-property comparison.
If you're curious to know what this query is costing you, the RU value is displayed both in the portal and via x-ms-request-charge return header.

I agree with David's answer and using a single docType field is what I did when I first started using DocumentDB. However, there is another option that I started using after doing some experiments. That is to create an is<Type> field and setting its value to true. This is slightly more efficient for queries than using a single string field, because the indexes themselves are smaller partial indexes, but could potentially take up slightly more storage space.
The other advantage to this approach is that it provides advantages for inheritance and mixins. For example, I have both isLookup=true and isState=true on certain entities. I also have other lookup types. Then in my application code, some behaviors are common for all lookup fields and other behaviors are only applicable to the State type.

If you index the type property on the collection, it will not be a complete scan.

Related

jOOQ difference between Record and TableRecord

I would like to know what the difference is between a jOOQ Record and a TableRecord. So for example a User and a UserRecord. I can see that it has something to do with the actual nullability of a certain table, but why does everyone use the TableRecord and when should I ever use the normal Record?
Thanks!
There's a manual page about literally your question: Record vs. TableRecord. In short:
Record is the generic super type of all jOOQ records.
TableRecord is a specific type of record, which can be associated with a table in your schema. This type is typically extended by code generation output
So for example a User and a UserRecord
This might be a different question. jOOQ's code generator produces these artifacts for each table, depending on your configuration:
The Table (e.g. User). You use this to construct type safe jOOQ queries
The TableRecord (e.g. UserRecord). You can use this to simplify some CRUD operations
The POJO (e.g. User, but in a different package). You can use this to map results to simple POJOs

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

Does ArangoDB "know" what attributes exist in a collection? (shapes data)

There's a recipe how to sample documents and determine their structure:
https://docs.arangodb.com/cookbook/AccessingShapesData.html
It is stated, that you can't query internal shapes data. But examining some documents will only approximate what attribute keys are used, or the entire collection must be scanned.
So my question is: does the database store what attributes exist somewhere internally? At least for common attributes?
If yes, why isn't it possible to query that data? It would be far more efficient than a user-defined function that outputs roughly the same information.
It would be great if one could discover schemes "for free":
http://som-research.uoc.edu/tools/jsonDiscoverer/#/
Whenever an attribute is used first in a collection, ArangoDB will store this somewhere internally. That means it does keep track of which attributes were used in a collection. There are a few issues however:
the attribute names are stored globally, but nested attribute names are stored separately (ex: user.name will be stored as user and name). From looking at purely the separate attribute name parts, ArangoDB will not know in which combinations they are used in the data
attribute names are stored whenever an attribute name is first used in a collection. Currently ArangoDB does not keep track of when an attribute is not used anymore. The attribute name will then still be present in the list of attributes
Under these restrictions, the list of attributes could be made available, but I am not sure how useful this will be.

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

RavenDB: How to query many different documents in a single query?

I am new to RavenDB and I'm not sure how to address this issue.
I have a document store with around 200 different document types. Each type can contain thousands of documents.
In my business logic all the different document types are treated the same - they can be all mapped to a generic object such as a DataTable.
I would like to query all the properties of all the documents from all types in a single free text search. What is the best way to do that?
You can do this using multi maps. Take a look at this post:
http://ayende.com/blog/156225/relational-searching-sucks-donrsquo-t-try-to-replicate-it

Resources