I would like to ask a question about a possible solution for an e-commerce database design in terms of scalability and flexibility.
We are going to use MongoDB and Node on the backend.
I included an image for you to see what we have so far. We currently have a Products table that can be used to add a product into the system. The interesting part is that we would like to be able to add different types of products to the system with varying attributes.
For example, in the admin management page, we could select a Clothes item where we should fill out a form with fields such as Height, Length, Size ... etc. The question is how could we model this way of structure in the database design?
What we were thinking of was creating tables such as ClothesProduct and many more and respectively connect the Products table to one of these. But we could have 100 different tables for the varying product types. We would like to add a product type dynamically from the admin management. Is this possible in Mongoose? Because creating all possible fields in the Products table is not efficient and it would hit us hard for the long-term.
Database design snippet
Maybe we should just create separate tables for each unique product type and from the front-end, we would select one of them to display the correct form?
Could you please share your thoughts?
Thank you!
We've got a mongoose backend that I've been working on since its inception about 3 years ago. Here some of my lessons:
Mongodb is noSQL: By linking all these objects by ID, it becomes very painful to find all products of "Shop A": You would have to make many queries before getting the list of products for a particular shop (shop -> brand category -> subCategory -> product). Consider nesting certain objects in other objects (e.g. subcategories inside categories, as they are semantically the same). This will save immense loading times.
Dynamically created product fields: We built a (now) big module that allows user to create their own databse keys & values, and assign them to different objects. In essence, it looks something like this:
SpecialFieldModel: new Schema({
...,
key: String,
value: String,
...,
})
this way, you users can "make their own products"
Number of products: Mongodb queries can handle huge dataloads, so I wouldn't worry too much about some tables beings thousands of objects large. However, if you want large reports on all the data, you will need to make sure your IDs are in the right place. Then you can use the Aggregation framework to construct big queries that might have to tie together multiple collectons in the db, and fetch the data in an efficient manner.
Don't reference IDs in both directions, unless you don't know what you're doing: Saving a reference to category ID in subcatgories and vice-versa is incredibly confusing. Which field do you have to update if you want to switch subcategories? One or the other? Or both? Even with strong tests, it can be very confusing for new developers to understand "which direction the queries are running in" (if you are building a proudct that might have to be extended in the future). We've done both which has led to a few problems. However, those modules that saved references to upper objects (rather than lower ones), I found to be consistently more pleasant and simple to work with.
created/updatedAt: Consider adding these fields to every single model & Schema. This will help with debugging, extensibility, and general features that you will be able to build in the future, which might otherwise be impossible. (ProductSchema.set('timestamps', true);)
Take my advice with a grain of salt, as I haven't designed most of our modules. But these are the sorts of things I consider as continue working on our applications.
Related
Suppose I have database tables Customer, Order, Item. I have OrderRepository that accesses, directly with SQL/my ORM, both the Order and Items table. E.g. I could have a method, getItems on the OrderRespositry that returns all items of that order.
Suppose I now also create ItemRepository. Given I now have 2 repositories accessing the same database table, is that generally considered poor design? My thinking is, sometimes a user wants to update the details about an Item (e.g. name), but when using the OrdersRepository, it doesn't really make sense to not be able to access the items directly (you want to know about all the items in an order)
Of course, the OrderRepository could internally create* an ItemRepository and call methods like getItemsById(ids: string[]). However, consider the case that I want to get all orders and items ever purchased by a Customer. Assuming you had the orderIds for a customer, you could have a getOrders(ids: string[]) on the OrderRepository to fetch all the orders and then do a second query to fetch all the Items. I feel you make your life harder (and less efficient) in the sense you have to do the join to match items with orders in the app code rather than doing a join in SQL.
If it's not considered bad practice, is there some kind of limit to how much overlap Repositories should have with each other. I've spent a while trying to search for this on the web, but it seems all the tutorials/blogs/vdieos really don't go further than 1 table per entity (which may be an anti-pattern).
Or am I missing a trick?
Thanks
FYI: using express with TypeScript (not C#)
is a repository creating another repository considered acceptable. shouldn't only the service layer do that?
It's difficult to separate the Database Model from the DDD design but you have to.
In your example:
GetItems should have this signature - OrderRepostiory.GetItems(Ids: int[]) : ItemEntity. Note that this method returns an Entity (not a DAO from your ORM). To get the ItemEntity, the method might pull information from several DAOs (tables, through your ORM) but it should only pull what it needs for the entity's hydration.
Say you want to update an item's name using the ItemRepository, your signature for that could look like ItemRepository.rename(Id: int, name: string) : void. When this method does it's work, it could change the same table as the GetItems above but note that it could also change other tables as well (For example, it could add an audit of the change to an AuditTable).
DDD gives you the ability to use different tables for different Contexts if you want. It gives you enough flexibility to make really bold choices when it comes the infrastructure that surrounds your domain. So ultimately, it's a matter of what makes sense for your specific situation and team. Some teams would apply CQRS and the GETOrder and Rename methods will look completely different under the covers.
We're building a database of cars and their properties, supposed to be stored in a DynamoDB.
Creating a cars table and filling it with objects that has properties like brand, model, year etc is easy.
But we also want a few other features en the admin interface:
Suggestions when typing
When creating a car, it should suggest brand and model from existing cars, when typing in the field.
Should we then maintain a list of brands and models in another table, and make a query to that table, when the user types?
Or is it good enough to query the "rich" table of car definitions, and get all values for brand, all model values where brand has a certain value, etc? My first thought is that it would be a heavy operation and we'd want a separate index of cars and models. But I'm not a NoSQL expert...
Best matches
When enrolling a new car in our system we want to use use an existing defined car as a reference if possible.
So when the user has typed in a brand, model, year etc we want to show a few options of the best matches - we can accept that they year etc. is different, but want the best matches first.
What is the best way to do matches like this on data in a NoSQL database? Any links to tools, concepts etc. will be appreciated :)
Thanks in advance
In dynamodb (all nosql), the less you create tables the best is your architecture (this is one of the main reason we use nosql), so no need of a new table, just add a new attribute and fill it with the searchable data you want, just have in mind that querying by dynamodb is case sensitive and you only can use the begins_with or the contains function to query data
The cons are :
You will use lot of reading capacity unit
You have to manage the capital letters
You have to fabric at each creation the searchable attribute
The solution I suggest is using aws cloudsearch, which gives an out of the boxes suggester, you will will have better results and give a better user experience, the indexation in cloudsearch is automatic each time you have a new item, but be aware of the pricing, however they will give you 30 day for free
I just want to have an expert opinion about my use case and the way I am planning to use indices to see if there is no problem in my approach or if there are any better ways to achieve it. Since I am new to ES, your opinions would really help me. We are storing data in couchdb in different databases based for each type of data.
I have database that serves as a link between 2 databases. For example, database A has 'floor' data, database B that links floor to items and then separate database for each item that a floor can have (e.g., card reader, camera etc).
We need to search for items that are linked to a floor and get them with filtering and paging. (Right now my links database has only ids and type but I am also planning to save name for each type as well in links db so that I can have filtering while I can do paging).
The way I want to achieve filtering and paging in my datastore is, I'll just have indices for each db. So based on floor, i'll get all its linked items for a type and 'search filter' (from index of links db) that would give me a page of certain items, i'll then use ids from that result to get those full objects (from index of) db of that item type.
Please let me know if there is any better approach in handling that, like e.g., if I can create one index for my floor and links and item databases and is it possible to do that through logstash couchdb plugin.
Many thanks.
Your setup does not sound wrong, but there are alternatives. You can use nested objects or parent-child relationships for an easier setup. Both approaches have their advantages. It all depends on the type of queries that you would like to do, and the amount of items that are related.
I would start by reading he next section of the definitive guide, that should give you a good start.
https://www.elastic.co/guide/en/elasticsearch/guide/current/modeling-your-data.html?q=model
I'm not yet ready to let this go, which is why I re-thought the problem and edited the Q (original below).
I am using mongoDB for a weekend project and it requires some relations in the DB, which is what the misery is all about:
I have three collections:
Users
Lists
Texts
A user can have texts and lists - lists 'contain' texts. Texts can be in multiple lists.
I decided to go with separate collections (not embeds) because child documents don't always appear in context of their parent (eg. all texts, without being in a list).
So what needs to be done is reference the texts that belong into certain lists with exactly those lists. There can be unlimited lists and texts, though lists will be less in comparison.
In contrast to what I first thought of, I could also put the reference in every single text-document and not all text-ids in the list-documents. It would actually make a difference, because I could get away with one query to find every snippet in a list. Could even index that reference.
var TextSchema = new Schema({
_id: Number,
name: String,
inListID: { type : Array , "default" : [] },
[...]
It is also rather seldom the case that texts will be in MANY lists, so the array would not really explode. The question kind of remains though, is there a chance this scales or actually a better way of implementing it with mongoDB? Would it help to limit the amount of lists a text can be in (probably)? Is there a recipe for few:many relations?
It would even be awesome to get references to projects where this has been done and how it was implemented (few:many relations). I can't believe everybody shies away from mongo DB as soon as some relations are needed.
Original Question
I'll break it down in two problems I see so far:
1) Let's assume a list consists of 5 texts. How do I reference the texts contained in a list? Just open an array and store the text's _ids in there? Seems like those arrays might grow to the moon and back, slowing the app down? On the other hand texts need to be available without a list, so embedding is not really an option. What if I want to get all texts of a list that contains 100 texts.. sounds like two queries and an array with 100 fields :-/. So is this way of referencing the proper way to do it?
var ListSchema = new Schema({
_id: Number,
name: String,
textids: { type : Array , "default" : [] },
[...]
Problem 2) I see with this approach is cleaning the references if a text is deleted. Its reference will still be in every list that contained the text and I wouldn't want to iterate through all the lists to clean out those dead references. Or would I? Is there a smart way to solve this? Just making the texts hold the reference (in which list they are) just moves the problem around, so that's not an option.
I guess I'm not the first with this sort of problem but I was also unable to find a definitive answer on how to do it 'right'.
I'm also interested in general thoughts on best-practice for this sort of referencing (many-to-many?) and especially scalability/performance.
Relations are usually not a big problem, though certain operations involving relations might be. That depends largely on the problem you're trying to solve, and very strongly on the cardinality of the result set and the selectivity of the keys.
I have written a simple testbed that generates data following a typical long-tail distribution to play with. It turns out that MongoDB is usually better at relations than people believe.
After all, there are only three differences to relational databases:
Foreign key constraints: You have to manage these yourself, so there's some risk for dead links
Transaction isolation: Since there are no multi-document transactions, there's some likelihood for creating invalid foreign key constraints even if the code is correct (in the sense that it never tries to create a dead link), but merely interrupted at runtime. Also, it is hard to check for dead links because you could be observing a race condition
Joins: MongoDB doesn't support joins, though a manual subquery with $in does scale well up to several thousand items in the $in-clause, provided the reference values are indexed, of course
Iff you need to perform large joins, i.e. if your queries are truly relational and you need large amount of the data joined accordingly, MongoDB is probably not a good fit. However, many joins required in relational databases aren't truly relational, they are required because you had to split up your object to multiple tables, for instance because it contains a list.
An example of a 'truly' relational query could be "Find me all customers who bought products that got >4 star reviews by customers that ranked high in turnover in June". Unless you have a very specialized schema that essentially was built to support this query, you'll most likely need to find all the orders, group them by customer ids, take the top n results, use these to query ratings using $in and use another $in to find the actual customers. Still, if you can limit yourself to the top, say 10k customers of June, this is three round-trips and some fast $in queries.
That will probably be in the range of 10-30ms on typical cloud hardware as long as your queries are supported by indexes in RAM and the network isn't completely congested. In this example, things get messy if the data is too sparse, i.e. the top 10k users hardly wrote >4 star reviews, which would force you to write program logic that is smart enough to keep iterating the first step which is both complicated and slow, but if that is such an important scenario, there is probably a better suited data structure anyway.
Using MongoDB with references is a gateway to performance issues. Perfect example of what not to use. This is a m:n kind of relation where m and n can scale to millions. MongoDB works well where we have 1:n(few), 1:n(many), m(few):n(many). But not in situations where you have m(many):n(many). It will obviously result in 2 queries and lot of housekeeping.
I am not sure that is this question still actual, but i have similar experience.
First of all i want to say what tells official mongo documentation:
Use embedded data models when: you have one-to-one or one-to-many model.
For model many-to-many use relationships with document references.
I think is the answer) but this answer provide a lot of problems because:
As were mentioned, mongo don't provide transactions at all.
And you don't have foreign key constraints.
Even if you have references (DBRefs) between documents, you will be faced with amazing problem how to dereference this documents.
Each this item - is huge piece of responsibility, even if you work at weekend project. And it might mean that you should be write many code to provide simple behaviour of your system (for example you can see how realize transaction in mongo here).
I have no idea how done foreign key constraints, and i don't saw something in this direction in mongo documentation, that's why i think that it amazing challenge (and risk for project).
And the last, mongo references - it isn't mysql join, and you dont receive all data from parent collection with data from child collection (like all fields from table and all fields from joined table in mysql), you will receive just REFERENCE to another document in another collection, and you will need to do something with this reference (dereference).
It can be easily reached in node by callback, but only in case when you need just one text from one list, but if you need all texts in one list - it's terrible, but if you need all texts in more than one list - it's become nightmare...
Perhaps it's my not the best experience... but i think you should think about it...
Using array in MongoDB is generally not preferable, and generally not advised by experts.
Here is a solution that came to my mind :
Each document of Users is always unique. There can be Lists and Texts for individual document in Users. So therefore, Lists and Texts have a Field for USER ID, which will be the _id of Users.
Lists always have an owner in Users so they are stored as they are.
Owner of Texts can be either Users or List, so you should keep a Field of LIST ID also in it, which will be _id of Lists.
Now mind that Texts cannot have both USER ID and LIST ID, so you will have to keep a condition that there should be only ONE out of both, the other should be null so that we can easily know who is the primary owner of the Texts.
Writing an answer as I want to explain how I will proceed from here.
Taking into consideration the answers here and my own research on the topic, it might actually be fine storing those references (not really relations) in an array, trying to keep it relativley small: less than 1000 fields is very likely in my case.
Especially because I can get away with one query (which I first though I couldn't) that doen't even require using $in so far, I'm confident that the approach will scale. After all it's 'just a weekend-project', so if it doesn't and I end up re-writing - that's fine.
With a text-schema like this:
var textSchema = new Schema({
_id: {type: Number, required: true, index: { unique: true }},
...
inList: { type : [Number] , "default" : [], index: true }
});
I can simply get all texts in a list with this query, where inList is an indexed array containing the _ids of the texts in the list.
Text.find({inList: listID}, function(err, text) {
...
});
I will still have to deal with foreign key constraints and write my own "clean-up" functions that take care of removing references if a list is removed - remove reference in every text that was in the list.
Luckily this will happen very rarely, so I'm okay with going through every text once in a while.
On the other hand I don't have to care about deleting references in a list-document if a text is removed, because I only store the reference on one side of the relation (in the text-document). Quite an important point in my opinion!
#mnemosyn: thanks for the link and pointing out that this is indeed not a large join or in other words: just a very simple relation. Also some numbers on how long those complex operations take (ofc. hardware dependet) is a big help.
PS: Grüße aus Bielefeld.
What I found most helpful during my own research was this vid, where Alvin Richards also talks about many-to-many relations at around min. 17. This is where I got the idea of making the relation one-sided to save myself some work cleaning up the dead references.
Thanks for the help guys
👍
So I'm writing an application in NodeJS & ExpressJS. It's my first time I'm using a noSQL database like MongoDB and I'm trying to figure out how to fix my data model.
At start for our project we have written down everything in relationship database terms but since we recently switched from Laravel to ExpressJS for our project I'm a bit stuck on what to do with all my different tables layouts.
So far I have figured out it's better to denormalize your scheme but it does have to end somewhere, right? In the end you can end up storing your whole data in one collection. Well, not enterily but you get the point.
1. So is there a rule or standard that defines where to cut to make multiple collections?
I'm having a relation database with users (which are both a client or a store user), stores, products, purchases, categories, subcategories ..
2. Is it bad to define a relationship in a noSQL database?
Like every product has a category but I want to relate to the category by an id (parent does the job in MongoDB) but is it a bad thing? Or is this where you choose performance vs structure?
3. Is noSQL/MongoDB ment to be used for such large databases which have much relationships (if they were made in MySQL)?
Thanks in advance
As already written, there are no rules like the second normal form for SQL.
However, there are some best practices and common pitfalls related to optimization for MongoDB which I will list here.
Overuse of embedding
The BSON limit
Contrary to popular believe, there is nothing wrong with references. Assume you have a library of books, and you want to track the rentals. You could begin with a model like this
{
// We use ISBN for its uniqueness
_id: "9783453031456"
title: "Schismatrix",
author: "Bruce Sterling",
rentals: [
{
name:"Markus Mahlberg,
start:"2015-05-05T03:22:00Z",
due:"2015-05-12T12:00:00Z"
}
]
}
While there are several problems with this model, the most important isn't obvious – there will be a limited number of rentals because of the fact that BSON documents have a size limit of 16MB.
The document migration problem
The other problem with storing rentals in an array would be that this would cause relatively frequent document migrations, which is a rather costly operation. BSON documents are never partitioned and created with some additional space allocated in advance used when they grow. This additional space is called padding. When the padding is exceeded, the document is moved to another location in the datafiles and new padding space is allocated. So frequent additions of data cause frequent document migrations.
Hence, it is best practice to prevent frequent updates increasing the size of the document and use references instead.
So for the example, we would change our single model and create a second one. First, the model for the book
{
_id: "9783453031456",
title:"Schismatrix",
author: "Bruce Sterling"
}
The second model for the rental would look like this
{
_id: new ObjectId(),
book: "9783453031456",
rentee: "Markus Mahlberg",
start: ISODate("2015-05-05T03:22:00Z"),
due: ISODate("2015-05-05T12:00:00Z"),
returned: ISODate("2015-05-05T11:59:59.999Z")
}
The same approach of course could be used for author or rentee.
The problem with over normalization
Let's look back some time. A developer would identify the entities involved into a business case, define their properties and relations, write the according entity classes, bang his head against the wall for a few hours to get the triple inner-outer-above-and-beyond JOIN working required for the use case and all lived happily ever after. So why use NoSQL in general and MongoDB in particular? Because nobody lived happily ever after. This approach scales horribly and almost exclusively the only way to scale is vertical.
But the main difference of NoSQL is that you model your data according to the questions you need to get answered.
That being said, let's look at a typical n:m relation and take the relation from authors to books as our example. In SQL, you'd have 3 tables: two for your entities (books and authors) and one for the relation (Who is the author of which book?). Of course, you could take those tables and create their equivalent collections. But, since there are no JOINs in MongoDB, you'd need three queries (one for the first entity, one for its relations and one for the related entities) to find the related documents of an entity. This wouldn't make sense, since the three table approach for n:m relations was specifically invented to overcome the strict schemas SQL databases enforce.
Since MongoDB has a flexible schema, the first question would be where to store the relation, keeping the problems arising from overuse of embedding in mind. Since an author might write quite a few books in the years coming, but the authorship of a book rarely, if at all, changes, the answer is simple: We store the authors as a reference to the authors in the books data
{
_id: "9783453526723",
title: "The Difference Engine",
authors: ["idOfBruceSterling","idOfWilliamGibson"]
}
And now we can find the authors of that book by doing two queries:
var book = db.books.findOne({title:"The Difference Engine"})
var authors = db.authors.find({_id: {$in: book.authors})
I hope the above helps you to decide when to actually "split" your collections and to get around the most common pitfalls.
Conclusion
As to your questions, here are my answers
As written before: No, but keeping the technical limitations in mind should give you an idea when it could make sense.
It is not bad – as long as it fits your use case(s). If you have a given category and its _id, it is easy to find the related products. When loading the product, you can easily get the categories it belongs to, even efficiently so, as _id is indexed by default.
I have yet to find a use case which can't be done with MongoDB, though some things can get a bit more complicated with MongoDB. What you should do imho is to take the sum of your functional and non functional requirements and check wether the advantages outweigh the disadvantages. My rule of thumb: if one of "scalability" or "high availability/automatic failover" is on your list of requirements, MongoDB is worth more than a look.
The very "first" thing to consider when choosing an "NoSQL" solution for storage over an "Relational" solution is that things "do not work in the same way" and therefore respond differently by design.
More specifically, solutions such as MongoDB are "not meant" to "emulate" the "relational join" structure that is present in many SQL and therefore "relational" backends, and that they are moreover intended to look at data "joins" in a very different way.
This arrives at your "questions" as follows:
There really is no set "rule", and understand that the "rules" of denormalization do not apply here for the basic reason of why NoSQL solutions exist. And that is to offer something "different" that may work well for your situation.
Is it bad? Is it Good? Both are subjective. Considering point "1" here, there is the basic consideration that "non-relational" or "NoSQL" databases are designed to do things "differently" than a relational system is. So therefore there is usually a "penalty" to "emulating joins" in a relational manner. Specifically for MongoDB this means "additional requests". But that does not mean you "cannot" or "should not" do that. Rather it is all about how your usage pattern works for your application.
Re-capping on the basic points made above, NoSQL in general is designed to solve problems that do not suit the traditional SQL and/or "relational" design pattern, and therefore replace them with something else. The "ultimate goal" here is for you to "rethink your data access patterns" and evolve your application to use a storage model that is more suited to how you access it in your application usage.
In short, there are no strict rules, and that is also part of the point in moving away from "nth-normal-form" rules. NoSQL solutions such as MongoDB allow for "nested structure" storage that typical SQL/Relational solutions do not provide in an efficient form.
Another side of this is considering that operations such as "joins" do not "scale" well over "big data" forms, therefore there exists the different way to "join" by offering concepts such as "embedded data structures", such as MongoDB does.
You would do well to real some guides on the subjects of how many NoSQL solutions approach storing and accessing data. This is ultimately what you need to decide on to determine which is best for you and your application.
At the end of the day, it should be about realising when a SQL/Relational model does not meet your needs, and then choosing something else.