How to arrange my Data in NoSQL (Invoices) - couchdb

i'm walking my first steps with nosql databases, but so far my knowledge is very basic. I try to set up a database for a small invoice system.
In SQL i'd create 4 Tables: Products, Customers , Invoices, and a match table for Invoice and the produts.
But how to do this with nosql? Do i even build relations or just build 1 document for each invoice.

You should keep in mind that NoSQL design is not only based on data structure but also strongly on data function. So you should first ask yourself what kind of queries you need to do over your data and take it from there.
First figure out how far you want to go with denormalization and aggregation. For instance: what sets of data will often require to query or update at once? And try to keep that to a single document even if it means duplicating data from other entities (i.e. Storing customer data along with the invoice data).
So ask yourself why you want to use non relational databases, and how will you use that data. Then decide which modeling techniques to apply and how far. The highly scalable blog has a great article about NoSQL data modeling if you care to give it a read.

... or just build 1 document for each invoice.
Yes, do that for the beginning. Imagine your data in the CouchDB as read-only copy of your data in the relational database. The docs are like the result of your SQL queries.
Do i even build relations?
Of course you can, its the same as in your SQL tables. You including ids of foreign docs and name the property regarding to the relation you want to express e.g. doc.customer_id in an invoice doc can point to the doc._id of a customer doc.
Its helpful you imagine the CouchDB views as "relations" e.g. you can create a view called InvoicesByCustomer with the example above.
But summarized i would recommend to begin with the 1 document for each invoice.-approach and follow #JavoSN hint ...
So you should first ask yourself what kind of queries you need to do over your data and take it from there
... when you know that clearly its time to dig deeper into your possibilities of document designs.

Related

MongoDb slow aggregation with many collections (lookup)

i'm working on a MEAN stack project, i use too many collections in my aggregation so i use a lot of lookup, and that impacts negatively the performance and makes the execution of aggregation very slow. i was wondering if you have any suggestions , i found that we can reduce lookup by creating for each collection i need an array of objects into a globale collection however, i'm looking for an optimale and secured solution.
As an information, i defined indexes on all collections into mongo.
Thanks for sharing your ideas!
This is a very involved question. Even if you gave all your schemas and queries, it would take too long to answer, and be very specific to your case (ie. not useful to anyone else coming along later).
Instead for a general answer, I'd advise you to read into denormalization and consider some database redesign if this query is core to your project.
Here is a good article to get you started.
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
A simple example to outline it:
Say you have a Blog with a comment collection, and a user collection
You want to display the comment with the name of the user. So you have to load the player for every comment.
Instead you could save the username on the comment collection as well as the user collection.
Then you will have a fast query to show comments, as you don't need to load the users too. But if the user changes their name, then you will have to update all of the comments with the new name. This is the main tradeoff.
If a DB redesign is too difficult, I suggest splitting into multiple aggregates and combining them in memory (ie. in your node server side code)

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

DocumentDB data structure misunderstanding

I'm starting a new website project and i would like to use DocumentDB as database instead of traditional RDBMS.
I will need two kind of documents to store:
User documents, they will hold all the user data.
Survey documents, that will hold all data about survays.
May i put both kind in a single collection or should i create one collection for each?
How you do this is totally up to you - it's a fairly broad question, and there are good reasons for combining, and good reasons for separating. But objectively, you'll have some specific things to consider:
Each collection has its own cost footprint (starting around $24 per collection).
Each collection has its own performance (RU capacity) and storage limit.
Documents within a collection do not have to be homogeneous - each document can have whatever properties you want. You'll likely want some type of identification property that you can query on, to differentiate document types, should you store them all in a single collection.
Transactions are collection-scoped. So, for example, if you're building server-side stored procedures and need to modify content across your User and Survey documents, you need to keep this in mind.

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

How to perform intersection operation on two datasets in Key-Value store?

Let's say I have 2 datasets, one for rules, and the other for values.
I need to filter the values based on rules.
I am using a Key-Value store (couchbase, cassandra etc.). I can use multi-get to retrieve all the values from one table, and all rules for the other one, and perform validation in a loop.
However I find this is very inefficient. I move massive volume of data (values) over the network, and the client busy working on filtering.
What is the common pattern for finding the intersection between two tables with Key-Value store?
The idea behind the nosql data model is to write data in a denormalized way so that a table can answer to a precise query. To make an example imagine you have reviews made by customers on shops. You need to know the reviews made by a user on shops and also reviews received by a shop. This would be modeled using two tables
ShopReviews
UserReviews
In the first table you query by shop id in the second by user id but data are written twice and accessed directly using just a key access.
In the same way you should organize values by rules (can't be more precise without knowing what's the relation between them) and so on. One more consideration: newer versions of nosql db supports collections which might help to model 1 to many relations.
HTH, Carlo

Resources