I am very new to cosmosdb(documentdb), while going through the documentation I keep on reading one thing repeatedly that documentdb is schema free but I feel like collection in analogous to schema and both are logical view.
Wikipedia defined schema as 'The term "schema" refers to the organization of data as a blueprint of how the database is constructed'. I believe collection is also same it's the organization of document, stored prcedure, triggers and UDF.
So my question is, how schema is different from collection?
Collections really have nothing to do with schema. They are just an organizational construct for documents. With Cosmos DB, they serve as:
a transaction boundary. Within a collection, you can perform multiple queries / updates within a transaction, utilizing stored procedures. These updates are constrained to a single collection (more specifically, to a single partition within a collection).
a billing/performance boundary. Cosmos DB lets you specify the number of Request Units (RU) / second to allocate to a collection. Every collection can have a different RU setting. Every collection has a minimum cost (due to minimum amount of RU that must be allocated), regardless of how much storage you consume.
a server-side code boundary. Stored procedures, triggers, etc. are uploaded to a specific collection.
Whether you choose to create a single collection per object type, or store multiple object types within a single collection, is entirely up to you. And unrelated to the shape of your data.
The schema of relational databases is slightly different from the schema of document databases. In simple terms, a relational database is stricter than that of a document schema. In other words, records in an RDBMS table must strictly adhere to the schema, where as we have some amount of flexibility while storing a document into a Document collection.
Conventionally a collection is a set of documents which follows the same schema. But document DBs don't stop one from storing documents with different schema in a single collection. It is the flexibility it gives to the users.
Let us take an example. Let us assume we are storing some customer information.
In relational DB, we might have some structure like
Customer ID INT
Name VARCHAR(50)
Phone VARCHAR(15)
Email VARCHAR(255)
Depending on customer having an email or phone number, they will be recorded as proper values or null values.
ID, Name, Phone, Email
1, John, 83453452, -
2, Victor, -, -
3, Smith, 34535345, smith#jjjj
However in document databases, some columns need to appear in the collection, if they don't have any values.
[
{
id: "123",
name: "John",
phone:"2572525",
},
{
id: "456",
name: "Stephen",
},
{
id: "789",
name: "King",
phone:"2572525",
email:"king#asfaf"
}
]
However it is always advisable to stick to a schema in document db's even if they provide flexibility to store schema-less documents to a collection for maintainability purposes.
Related
For those of you who worked with Firebase (Firestore), then you can have a collection with documents in which each document has an id, then a collection can hold a sub collection (equivalent of an embedded document as an array of documents as a property).
Then, this sub collection can hold many documents while each has an id.
In Firestore the sub collection is lazy loaded.
It will fetch the documents on that collection, but if there is one or more sub collections, it won't retrieve it unless specifically going to that route. e.g: collection/document/subcollectionname/anotherdocument
So 2 questions:
Are embedded documents lazy loaded? I don't want to get a document with all of its embedded documents (possibly a million) unless I explicitly access to it.
How can I make sure each embedded document in MongoDB gets an "_id" in the form of ObjectID("blablabla")?
EDIT:
I currently have a firestore implementation which has a subcollection/s practice behind it.
Example: organization => documentId => projects => projectId => activities => :activityType => activityId
organization collection that holds documents (each document = organization).
Each organization document holds a schema (id, name, language, etc..) and a few subcollections in which one of them is projects subcollection
projects sub collection holds documents of projects.
a project document holds the project schema (id, name, location, etc..) and one subcollection named activities.
activities sub collection holds its own schema (id, type, category, etc...) and 6 more sub collections, each represents an activity type.
each activity sub collection holds its own schema. No more sub collections.
Now, the good thing about it is that if I choose to get all organizations, then I will get only the documents of the organization collection and NOT the embedded subcollections (projects, etc..) while in MongoDB, I would get EVERYTHING per document.
How do I achieve the same nested documents with their own nested documents structure with the lazy load effect in MongoDB?
How many activities can a single project have? If there's no limit then you're better off creating a root level collection for activities. In MongoDB, the maximum BSON document size is 16 MB. That being you may not be able to store all projects and their activities in a single document (organization document).
I would create 3 collections namely - organizations, projects and activities.
Each organization should have a document in organizations collection similar to that you have in Firestore.
Each project should have a document in projects collection containing a field "organizationID" so you can query projects of a specific organization using their ID. This is equivalent of a document in your projects sub-collection. Every project must also have it's own unique ID.
Each activity should have a document in activities collection containing a field "projectID" so activities of a specific project can be retrieved.
I've added those additional organizationID, projectID fields even though you have _id just in case you'd like to have Firestore Document IDs there for easier side-by-side queries.
You don't have to worry about 16 MB document size limit this way and it'll be easier to query both projects and activities as long as you have the correct IDs.
Querying activities of a certain project:
await db.collection("activities").find({projectID: "myProjectID"}).toArray()
Thereafter it's upto you how you want to write queries with projections, aggregation, etc.
I am trying to implement relations on collections. My requirement is
Post request 1, json body:
{
"username":"aaa",
"password":"bbb",
"role":"owner",
"company":"SAS"
}
Post request 2, creating from first document so I got company name from previous json body:
{
"username":"eee",
"password":"fff",
"role":"engineer",
"company":"SAS"
}
Post request 3, creating from first document so I got company name from previous json body:
{
"username":"uuu",
"password":"kkk",
"role":"engineer",
"company":"SAS"
}
Post request 4, next company json body:
{
"username":"hhh",
"password":"ggg",
"role":"owner",
"company":"GVG"
}
Here company is foreign key field. How can I achieve company with id field without failing one another like transactions.
In mysql I will create two tables company, user and using transactions i will insert in both tables in single post using id's if any update in company name id will remain same for owner and engineer.
How can I achieve these in mongodb, with node.js?
In online searches I have found most suggest avoid transactions and using mongodb functionalities like mongodb embedded.
I would suggest you to start with making schemas for user and company using mongoose. Its an ODM(object document mapper) which is almost always used with node.js and mongodb
Now this is one to many relations. In relational databases as you have mentioned, you would make a company table and a user table.
In mongodb it "depends". If its one to "few" relationship you would just nest the users array into company's collection. Then since you are only updating a single document(pushing user to users array in company's document), you wont be needing any transactions. Single document update is always atomic(no matter how many fields you update on the same document).
But if each company can have large number of users(ever growing nested array is not good, as it can cause data fragmentation and bad performance), then its better to store the company's id in user's document. And even in this case you would not need transaction, since you are not updating the company's document.
Another reason for storing user as separate collection, is query issues. If you just want to query users its difficult if they are nested in companies. So basically you need to consider how you will query and figure out the number of relations then decide to nest of store is separate collections.
First of all, you should notice that Mongo is document-oriented DB, not a relation one. So if you need transactions and relation model, probably you should try to use any SQL relative database? Especially if you are more familiar with them?
About relation and data modeling: you should this article (or even entire part) at official Mongo docs, Data Modelling.
TL:DR, you could create two separate collections (the same as tables in SQL) like employees, and companies (by default, collection's name will be in plural forms). And store data separately.
So you employees will be stored like you mention above, but companies will be like:
{
_id: ObjectID("35473645632")
name: "SAS"
}, ...
and as for your employees collection, you should store not like, "company":"SAS", but, "company":"ObjectID("35473645632"), or even as array if you want it too. But don't forgot to edit you schema than.
You could use not just MongoDB's default _id but your own one, it could be any unique number/string combination
So, if your company will be renamed, your connection with other documents (employees) still will be there.
To request all/any of your employees with company name's you should use .aggregation framework with $lookup, instead of .find.
I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}
We are planning to move from MySql to Cloudant NoSql. I want to understand what would be the best approach to do that.
We have 5 different tables--Product (ProductId Primary key), Issues (IssueId primary key, ProductId Foreign key) and Tags (Tag id Primary key, ProductId Foreign key) and Location (LocationId primary key location as foreign key with location in product table) and Policy (policyId primary key, IssueId as primary key).
Now we are thought of two approaches for maintaining documents in Cloudant.
Keep different documents for each row with unique document type per table (for each table one document type ex document types as "product","issues,"tag","location","policy" ).
Keep different document for each row with all relation defined in one document (all documents with type "product" only where maintaining all tags,issues[policies],location per product).
Which approach is better?
The answer really depends on the size and rate of growth of your data. In previous SQL->NoSQL migrations, I've used your second approach (I don't know your exact schema, so I'll guess):
{
_id: "prod1",
name: "My product",
tags: [
"red", "sport", "new"
],
locations: [
{
location_id: "55",
name: "London",
latitude: 51.3,
longitude: 0.1
}
],
issues: [
{
issue_id: "466",
policy_id: "88",
name: "issue name"
}
]
}
This approach allows you to get almost everything about a product in a single Cloudant API call (GET /products/prod1). Such a call will give you all of your primary product data and what would have been joins in a SQL world - in this case arrays of things or arrays of objects.
You may still want another database of locations or policies because you may want to store extra information about those objects in separate collections, but you can store a sub-set of that data (e.g. the location's name and geo position) in the product document. This does mean duplicating some data from your reference "locations" collection inside each product, but leads to greater efficiency at query time (at the expense of making updates to the data more complicated).
It all depends about how you access the data. For speed and efficiency you want to be able retrieve the data you need to render a page in as few API calls as possible. If you keep everything in its own database, then you need to do the joins yourself, because Cloudant doesn't have joins. This would be inefficient because you would need an extra API call for each "join".
There is another way to managed "joins" in Cloudant and this may be suitable if your secondary collections are large e.g. if the number of locations/tags/issues would make the product document size too large.
I have Location table as below. I need to fetch the details of only driver whose ratings are above three for that particular location.
Thanks in advance...
[
{
"name":"Delhi",
"cab_details(sub table)":[
{
"driver_details"(join):{
"name":"1111",
"ratings_above_three":true
},
"date_joining": date
},
{
"driver_details":{
"name":"2222",
"ratings_above_three":false
},
"date_joining": date
}
]
}
]
It would be much easier if you put your drivers into separate collection.
MongoDB does not handle very well growing objects.
Follow in your schema design the principle of least cardinality
Refer to this question: mongodb-performance-with-growing-data-structure
I suspect the model you presented is City from collection cities. So I would recommend creating a separate collection called cab_drivers, and populating it with CabDriver documents. You can maintain a relationship between them by using ObjectId's . Store an array of CabDriver._id values in your City as cab_drivers, it would be much easier to query, validate and update them. Also the cab_drivers array inside a City document would grow much slower than in case of keeping the whole documents within the parent.
If you use mongoose you could then use: city.populate('cab_drivers') to load all related documents on application level.