Migrating mySql datatables to Clodant Documents - couchdb

We are planning to move from MySql to Cloudant NoSql. I want to understand what would be the best approach to do that.
We have 5 different tables--Product (ProductId Primary key), Issues (IssueId primary key, ProductId Foreign key) and Tags (Tag id Primary key, ProductId Foreign key) and Location (LocationId primary key location as foreign key with location in product table) and Policy (policyId primary key, IssueId as primary key).
Now we are thought of two approaches for maintaining documents in Cloudant.
Keep different documents for each row with unique document type per table (for each table one document type ex document types as "product","issues,"tag","location","policy" ).
Keep different document for each row with all relation defined in one document (all documents with type "product" only where maintaining all tags,issues[policies],location per product).
Which approach is better?

The answer really depends on the size and rate of growth of your data. In previous SQL->NoSQL migrations, I've used your second approach (I don't know your exact schema, so I'll guess):
{
_id: "prod1",
name: "My product",
tags: [
"red", "sport", "new"
],
locations: [
{
location_id: "55",
name: "London",
latitude: 51.3,
longitude: 0.1
}
],
issues: [
{
issue_id: "466",
policy_id: "88",
name: "issue name"
}
]
}
This approach allows you to get almost everything about a product in a single Cloudant API call (GET /products/prod1). Such a call will give you all of your primary product data and what would have been joins in a SQL world - in this case arrays of things or arrays of objects.
You may still want another database of locations or policies because you may want to store extra information about those objects in separate collections, but you can store a sub-set of that data (e.g. the location's name and geo position) in the product document. This does mean duplicating some data from your reference "locations" collection inside each product, but leads to greater efficiency at query time (at the expense of making updates to the data more complicated).
It all depends about how you access the data. For speed and efficiency you want to be able retrieve the data you need to render a page in as few API calls as possible. If you keep everything in its own database, then you need to do the joins yourself, because Cloudant doesn't have joins. This would be inefficient because you would need an extra API call for each "join".
There is another way to managed "joins" in Cloudant and this may be suitable if your secondary collections are large e.g. if the number of locations/tags/issues would make the product document size too large.

Related

How is the correct way to use CosmosDb with documents bigger than 2 Mb

I am using Azure CosmosDb as the database of my application.
Let's say that I need to save all the countries and cities and streets in my database. So, I would have an item who looked like this:
{
country: Brazil,
size: 1000,
population: 200000,
cities: [
{
city: Rio
population: 8000
streets: [
{
name: A,
postalCode: 12345
},
{
name: B,
postalCode: 34567
}
],
...
However, as I am talking about all the countries and cities and streets, this becomes a huge item, bigger than the 2Mb allowed by the cosmosDb.
So, what is the correct way to deal with this? Should I separate the cities and streets in different collections? However, using different collections have many drawbacks, since it is not possible to use stored procedure or guarantee the transaction when updating two different collections.
you can put these into same collection, just use primary keys to separate them logically (this is not technically needed, its just better). With your data set it, probably, makes sense to partition on city (or, less likely, country). You dont have to have identical documents in the same collection, although they would look pretty much the same.
Can you explain why you need everything in a single giant document? Any why do you need transactions for updates of this data?
A better approach is to use multiple individual documents, all stored in a single collection for easier management.
Use a field in each document to describe what level it's for (country, city, zip) and then store all the necessary information in that document for that level. You can probably use the country as the partition key as it will likely fit within the 10GB/partition limit.

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

How collection is different from schema

I am very new to cosmosdb(documentdb), while going through the documentation I keep on reading one thing repeatedly that documentdb is schema free but I feel like collection in analogous to schema and both are logical view.
Wikipedia defined schema as 'The term "schema" refers to the organization of data as a blueprint of how the database is constructed'. I believe collection is also same it's the organization of document, stored prcedure, triggers and UDF.
So my question is, how schema is different from collection?
Collections really have nothing to do with schema. They are just an organizational construct for documents. With Cosmos DB, they serve as:
a transaction boundary. Within a collection, you can perform multiple queries / updates within a transaction, utilizing stored procedures. These updates are constrained to a single collection (more specifically, to a single partition within a collection).
a billing/performance boundary. Cosmos DB lets you specify the number of Request Units (RU) / second to allocate to a collection. Every collection can have a different RU setting. Every collection has a minimum cost (due to minimum amount of RU that must be allocated), regardless of how much storage you consume.
a server-side code boundary. Stored procedures, triggers, etc. are uploaded to a specific collection.
Whether you choose to create a single collection per object type, or store multiple object types within a single collection, is entirely up to you. And unrelated to the shape of your data.
The schema of relational databases is slightly different from the schema of document databases. In simple terms, a relational database is stricter than that of a document schema. In other words, records in an RDBMS table must strictly adhere to the schema, where as we have some amount of flexibility while storing a document into a Document collection.
Conventionally a collection is a set of documents which follows the same schema. But document DBs don't stop one from storing documents with different schema in a single collection. It is the flexibility it gives to the users.
Let us take an example. Let us assume we are storing some customer information.
In relational DB, we might have some structure like
Customer ID INT
Name VARCHAR(50)
Phone VARCHAR(15)
Email VARCHAR(255)
Depending on customer having an email or phone number, they will be recorded as proper values or null values.
ID, Name, Phone, Email
1, John, 83453452, -
2, Victor, -, -
3, Smith, 34535345, smith#jjjj
However in document databases, some columns need to appear in the collection, if they don't have any values.
[
{
id: "123",
name: "John",
phone:"2572525",
},
{
id: "456",
name: "Stephen",
},
{
id: "789",
name: "King",
phone:"2572525",
email:"king#asfaf"
}
]
However it is always advisable to stick to a schema in document db's even if they provide flexibility to store schema-less documents to a collection for maintainability purposes.

How to create composite partition key in documentdb

This is how my document looks, I wanted to choose the partition key as /department/city; which is two different attribute, one form employee and another from address(which is an embedded object to employee). I tried give /department/address as partition key but it's not listing the partition key in data-explorer, I'm assuming it considering city is a attribute in department.
{
"eid": "",
"entryType": "",
"address":
{
"PIN": "",
"city": "",
"street": ""
},
"name": "",
"id": "",
"department": "",
"age":
}
Can you please help me understand what I'm doing wrong and how to design a composite partition key and distribute/store/arrange employees data based on their department and city.
Can you please help me understand what I'm doing wrong and how to
design a composite partition key and distribute/store/arrange
employees data based on their department and city.
If I am not mistaken, currently composite partition keys are not supported. In a collection you must define a partition key using just one attribute.
However if you look at the REST API, the partition key is defined as an array (albeit that array only contains 1 element). This tells me that Azure may support composite partition keys in future.
So, for now you pick one attribute (either department or city) to partition the data and define an index on the other attribute for faster searching.
In my CosmosDb multi-partitioned collections I generally specify that the partitionKey should be generic and just use a property that is literally called "partitionKey". The advantage of this is that you have much more fine grained control over how your data is stored by simply specifying a value that makes sense for the particular POCO you are inserting.
There's a high likelihood that somewhere down the line you're going to want to insert a document into this collection that doesn't conform to the structure of the document you showed us here. At this point you'll either have to split your docs across multiple collections or redesign your whole partition key strategy.
The trade off here is that during querying you have to have some knowledge of how these values are applied but this can easily be enforced in code through a lightweight ORM layer.

MongoDB - How to search array of dictionary in dictionary

I have Location table as below. I need to fetch the details of only driver whose ratings are above three for that particular location.
Thanks in advance...
[
{
"name":"Delhi",
"cab_details(sub table)":[
{
"driver_details"(join):{
"name":"1111",
"ratings_above_three":true
},
"date_joining": date
},
{
"driver_details":{
"name":"2222",
"ratings_above_three":false
},
"date_joining": date
}
]
}
]
It would be much easier if you put your drivers into separate collection.
MongoDB does not handle very well growing objects.
Follow in your schema design the principle of least cardinality
Refer to this question: mongodb-performance-with-growing-data-structure
I suspect the model you presented is City from collection cities. So I would recommend creating a separate collection called cab_drivers, and populating it with CabDriver documents. You can maintain a relationship between them by using ObjectId's . Store an array of CabDriver._id values in your City as cab_drivers, it would be much easier to query, validate and update them. Also the cab_drivers array inside a City document would grow much slower than in case of keeping the whole documents within the parent.
If you use mongoose you could then use: city.populate('cab_drivers') to load all related documents on application level.

Resources