homogeneous vs heterogeneous in documentdb - azure

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?

Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

Related

In cosmosdb, should I reference other documents using id, resource id, or self link?

I'm working on designing my CosmosDB collections and deciding what I will and won't nest in a single document, etc. There's no way around it, though - there will be scenarios where I need to reference documents from one collection within another.
I see that in CosmosDB there are several ways to identify a document - id, resource id and self link. It looks like id is enforced to be unique and can either be set by server or to whatever you want it to be. Next, it looks like resource id is always auto generated by the server and is guaranteed to be unique as well. Last, it looks like self link is built up using the id of the database, collection and document, meaning it'll also be unique. I see three different unique keys, all having their own uses and semantics.
Which one should I use internally when referencing other documents?
What about referencing documents in different collections - would resource id or self link be more "universal identifier" than just id?
DO use natural key for id values, if possible.
DO use id for cross-document references.
DO use names for collection/database references.
DO NOT use _rid or _selflink when you need a reliable long-term reference.
Why not use _rid/selflink?
_rid - system-assigned identity in Comsos DB inner storage. It value is stable as long as document does not move in storage but it will change whenever document is recreated in storage.
_selflink - system-assigned identity similar to _rid, but in addition to _rid it includes similar resource sub-keys for the Cosmos DB database and collection the document is in. So it is a reference to the document from the account level.
First, most likely _rid/_selflink have the potential to be slightly more performant as they are closer to actual data. Though in 99% of situations it should be negligible.
On the downside, _rid/_selflink will change when you move your documents for whatever reason. E.g.,:
backup and restore
delete document and recreate this with exactly the same data
rename Cosmos DB database/collection (currently achieved by creating new and moving data)
recreate collection to get some new feature not applicable to existing collections
refactor collections structure by moving document types (for business/performance or security concerns)
Should this happen, you would be in a world of pain to discover and fix all references from within your data documents. Ouch. That's fragile and cumbersome assuming you have lots of documents and non-trivial models.
Also, if you look at Microsoft API clients (e.g., the C# client), then the comfortable path is nowadays is to work with database/collection names and ids. Don't fight it. You'll just make your code uglier and you own life harder than intended.
Using them for temporary ad-hoc identities is ok though.
Why id?
id is user-assigned key to a document with uniqueness guarantee within a partition.
It is optimized for retrieval in API and perf-wise = faster to develop, better performance.
It can be set to a natural key - human-readable and business-wise meaningful without loading the referenced document. = fewer lookups, less confusion, fewer RU/s.
It is part of the user data and will never change when you move your documents around = predictable behavior, fewer bad surprises during disaster recovery.
The only caveat is that, as always with user-given identities, you have to plan a bit to be sure the identity range really is unique enough for your needs. Your app can always set stricter uniqueness properties (though they would not be enforced by Cosmos DB) or if you need ultimate uniqueness, then use Guids.
What about containers?
Same arguments apply to containers/databases.
The id is only unique within the document partition. You could have as many documents with the same id as long as they have a different partition key values.
The _rid is indeed unique and it's the best form of identification for a document. You can achieve the same by using the id and also providing the partition key value if your collection is partitioned.
There are two different types of reading a document directly without querying for it.
Using its self link which looks like this dbs/db_resourceid/colls/coll_resourceid/documents/doc_resourceid and uses the _rid values
Using its alternative link which looks like this dbs/db_id/colls/coll_id/documents/doc_id which uses the id
The safest form of document identification you can use is the one that uses the _rids.
In both of your questions, you should go with the self link.

Cosmos DB: How to reference a document in a separate collection using DocumentDB API

I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}

DocumentDB data structure misunderstanding

I'm starting a new website project and i would like to use DocumentDB as database instead of traditional RDBMS.
I will need two kind of documents to store:
User documents, they will hold all the user data.
Survey documents, that will hold all data about survays.
May i put both kind in a single collection or should i create one collection for each?
How you do this is totally up to you - it's a fairly broad question, and there are good reasons for combining, and good reasons for separating. But objectively, you'll have some specific things to consider:
Each collection has its own cost footprint (starting around $24 per collection).
Each collection has its own performance (RU capacity) and storage limit.
Documents within a collection do not have to be homogeneous - each document can have whatever properties you want. You'll likely want some type of identification property that you can query on, to differentiate document types, should you store them all in a single collection.
Transactions are collection-scoped. So, for example, if you're building server-side stored procedures and need to modify content across your User and Survey documents, you need to keep this in mind.

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

What is the best way to store and search through object transactions?

We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?
We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).
I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh
Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.
If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.

Resources