rdb vs key-value store for django functionality [duplicate] - python-3.x

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
When would one choose a key-value data store over a relational DB? What considerations go into deciding one or the other? When is mix of both the best route? Please provide examples if you can.

Key-value, heirarchical, map-reduce, or graph database systems are much closer to implementation strategies, they are heavily tied to the physical representation. The primary reason to choose one of these is if there is a compelling performance argument and it fits your data processing strategy very closely. Beware, ad-hoc queries are usually not practical for these systems, and you're better off deciding on your queries ahead of time.
Relational database systems try to separate the logical, business-oriented model from the underlying physical representation and processing strategies. This separation is imperfect, but still quite good. Relational systems are great for handling facts and extracting reliable information from collections of facts. Relational systems are also great at ad-hoc queries, which the other systems are notoriously bad at. That's a great fit in the business world and many other places. That's why relational systems are so prevalent.
If it's a business application, a relational system is almost always the answer. For other systems, it's probably the answer. If you have more of a data processing problem, like some pipeline of things that need to happen and you have massive amounts of data, and you know all of your queries up front, another system may be right for you.

If your data is simply a list of things and you can derive a unique identifier for each item, then a KVS is a good match. They are close implementations of the simple data structures we learned in freshman computer science and do not allow for complex relationships.
A simple test: can you represent your data and all of its relationships as a linked list or hash table? If yes, a KVS may work. If no, you need an RDB.
You still need to find a KVS that will work in your environment. Support for KVSes, even the major ones, is nowhere near what it is for, say, PostgreSQL and MySQL/MariaDB.

IMO, Key value pair (e.g. NoSQL databases) works best when the underlying data is unstructured, unpredictable, or changing often. If you don't have structured data, a relational database is going to be more trouble than its worth because you will need to make lots of schema changes and/or jump through hoops to conform your data to the structure.
KVP / JSON / NoSql is great because changes to the data structure do not require completely refactoring the data model. Adding a field to your data object is simply a matter of adding it to the data. The other side of the coin is there are fewer constraints and validation checks in a KVP / Nosql database than a relational database so your data might get messy.
There are performance and space saving benefits for relational data models. Normalized relational data can make understanding and validating the data easier because there are table key relationships and constraints to help you out.
One of the worst patterns i've seen is trying to have it both ways. Trying to put a key-value pair into a relational database is often a recipe for disaster. I would recommend using the technology that suits your data foremost.

If you want O(1) lookups of values based on keys, then you want a KV store. Meaning, if you have data of the form k1={foo}, k2={bar}, etc, even when the values are larger/ nested structures, and want fast lookups, you want a KV store.
Even with proper indexing, you cannot achieve O(1) lookups in a relational DB for arbitrary keys. Sometimes this is referred to as "random lookups".
Alliteratively stated, if you only ever query by one column, a "primary key" if you will, to retrieve the rest of the data, then using that column as a keyspace and the rest of the data as a value in a KV store is the most efficient way to do lookups.
In contrast, if you often query the data by any of several columns, aka you support a richer query API for the data, then you may want a relational database.

A traditional relational database has problems scaling beyond a point. Where that point is depends a bit on what you are trying to do.
All (most?) of the suppliers of cloud computing are providing key-value data stores.
However, if you have a reasonably sized application with a complicated data structure, then the support that you get from using a relational database can reduce your development costs.

In my experience, if you're even asking the question whether to use traditional vs esoteric practices, then go traditional. While esoteric practices are sexy, challenging, and fun, 99.999% of applications call for a traditional approach.
With regards to relational vs KV, the question you should be asking is:
Why would I not want to use a relational model for this scenario: ...
Since you have not described the scenario, it's impossible for anyone to tell you why you shouldn't use it. The "catch all" reason for KV is scalability, which isn't a problem now. Do you know the rules of optimization?
Don't do it.
(for experts only) Don't do it now.
KV is a highly optimized solution to scalability that will most likely be completely unecessary for your application.

Related

Understanding Cassandra - can it replace RDBMS? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've spent the last week cramming on Cassandra, trying to understand the basics, as well as if it fits our needs, or not. I think I understand it on a basic level at this point, but if it works like I believe I'm being told...I just can't tell if it's a good fit.
We have a microservices platform which is essentially a large data bus between our customers. They use a set of APIs to push and pull shared data. The filtering, thus far, is pretty simple...but there's no way to know what the future may bring.
On top of this platform is an analytics layer with several visualizations (bar charts, graphs, etc.) based on the data being passed around.
The microservices platform was built atop MySQL with the idea that we could use clustering, which we honestly did not have a lot of luck with. On top of that, changes are painful, as is par for the course in the RDBMS world. Also, we expect extraordinary amounts of data with thousands-upon-thousands of concurrent users - it seems that we'll have an inevitable scaling problem.
So, we began looking at Cassandra as a distributed nosql potential replacement.
I watched the DataStax videos, took a course on another site, and started digging in. What I'm finding is:
Data is stored redundantly across several tables, each of which uses different primary and clustering keys, to enable different types of queries, since rows are scattered across different nodes in the cluster
Rather than joining, which isn't supported, you'd denormalize and create "wide" tables with tons of columns
Data is eventually consistent, so new writes may not be readily readable in a predictable, reasonable amount of time.
CQL, while SQL-like, is mostly a lie. How you store and key data determines which types of queries you can use. It seems very limited and inflexible.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs. If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
I want to like this idea and love the distributed features, but frankly am mostly scared off, at this point. I feel like I've learned a lot and nothing at all, in the last week, and am entirely unsure how to proceed.
I looked into JanusGraph, Elassandra, etc. to see if that would provide a simpler interface on top of Cassandra, relegating it to basically a storage engine, but am not confident many of these things are mature enough or even proper, for what we need.
I suppose I'm looking for direction and insight from those of you who have built things w/ Cassandra, to see if it's a fit for what we're doing. I'm out of R&D time, unfortunately. Thanks!
Understanding Cassandra - can it replace RDBMS?
The short answer here, is "NO." Cassandra is not a simple drop-in replacement for a RDBMS, when you suddenly need it to scale.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs.
It fits long-term database needs if you're applying it to the right use case.
DISCLAIMER: I am a bit of a Cassandra zealot. I've used it for a while, made minor contributions to the project, been named a "Cassandra MVP," and even co-authored a book about it. I think it's a great piece of tech, and you can do amazing things with it.
That being said, there are a lot of things that it's just not good at:
Query flexibility. The tradeoff you make for spreading rows across multiple nodes to meet operational scale, is that you have to know your query patterns ahead of time, and then follow them strictly. The idea, is that you want to have all queries served by a single node. And you'll have to put some thought into your data model to achieve that. Unbound queries (SELECTs without WHERE clauses) become the enemy.
Updating data in-place. Plan on storing values by a key, but then updating them a lot (ex: status)? Cassandra is not a good fit for that. This is because Cassandra has a log-based storage engine which doesn't overwrite anything...it just obsoletes it. So your previous values are still there, and still take up space and compute resources.
Deleting Data. Deleting data in the distributed database world is tricky. After all, how do you replicate nothing to another node? Cassandra's answer to that problem, is to use a structure called a tombstone. Tombstones take up space, can slow performance, and need to stay around long enough to replicate (making their removal tricky).
Maintaining Data Consistency. Being highly-available and partition tolerant, Cassandra embraces the concept of "eventual consistency." So it should come as no surprise that it really wasn't designed to be consistent. It has a lot of mechanisms which will help keep data consistent, but they are far from perfect. Plus, there really isn't a way to know for sure if your data is in sync or not.
If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
Materialized views are something that I'd continue to stay away from for the foreseeable future. They're "experimental" for a reason. Basically, once they're out of sync, the only way to get them back in sync is to rebuild them.
I coach my dev teams on keeping their query tables (tables containing the same data, just keyed differently) in sync with BATCH statements. In fact, BATCH is a misnomer as it probably should have bene named "ATOMIC" instead. Because of its name, it is heavily mis-used, and its mis-use can lead to problems. But, it does keep mutations applied atomically, so that does help.
Basically, scrutinize your database requirements. If Cassandra doesn't cut it, then try to find one which does. CockroachDB (or one of the other NewSQLs) might be a better fit for what you're talking about. It tries to be a drop-in for Postgres, and it scales with some Cassandra-like mechanisms, so it might be worth looking into.
Cassandra is very good at what it does but it is not a drop-in replacement for an RDBMS. If you find that you need any of the following, I would not encourage you to migrate to Cassandra:
Strict consistency
ACID transactions
Support for ad-hoc queries, including joins, aggregates, etc.
Now as for you hitting some limits (or thinking you will hit them in the future) with MySQL, here are some thoughts:
Don't think that a limitation in MySQL is a limitation in RDBMS in general. Just so you don't think I am a $some_other_DB zealot, I've been using MySQL for almost 20 years, but it is not the best tool for all jobs.
If by 'changes' you mean 'schema changes', a lot of the pain can be alleviated by either:
Using an RDBMS where they are implemented better (including perhaps a more recent MySQL version)
Using community supported tools such as pt-online-schema-change or gh-ost
Good luck!

MongoDB (noSQL) when to split collections

So I'm writing an application in NodeJS & ExpressJS. It's my first time I'm using a noSQL database like MongoDB and I'm trying to figure out how to fix my data model.
At start for our project we have written down everything in relationship database terms but since we recently switched from Laravel to ExpressJS for our project I'm a bit stuck on what to do with all my different tables layouts.
So far I have figured out it's better to denormalize your scheme but it does have to end somewhere, right? In the end you can end up storing your whole data in one collection. Well, not enterily but you get the point.
1. So is there a rule or standard that defines where to cut to make multiple collections?
I'm having a relation database with users (which are both a client or a store user), stores, products, purchases, categories, subcategories ..
2. Is it bad to define a relationship in a noSQL database?
Like every product has a category but I want to relate to the category by an id (parent does the job in MongoDB) but is it a bad thing? Or is this where you choose performance vs structure?
3. Is noSQL/MongoDB ment to be used for such large databases which have much relationships (if they were made in MySQL)?
Thanks in advance
As already written, there are no rules like the second normal form for SQL.
However, there are some best practices and common pitfalls related to optimization for MongoDB which I will list here.
Overuse of embedding
The BSON limit
Contrary to popular believe, there is nothing wrong with references. Assume you have a library of books, and you want to track the rentals. You could begin with a model like this
{
// We use ISBN for its uniqueness
_id: "9783453031456"
title: "Schismatrix",
author: "Bruce Sterling",
rentals: [
{
name:"Markus Mahlberg,
start:"2015-05-05T03:22:00Z",
due:"2015-05-12T12:00:00Z"
}
]
}
While there are several problems with this model, the most important isn't obvious – there will be a limited number of rentals because of the fact that BSON documents have a size limit of 16MB.
The document migration problem
The other problem with storing rentals in an array would be that this would cause relatively frequent document migrations, which is a rather costly operation. BSON documents are never partitioned and created with some additional space allocated in advance used when they grow. This additional space is called padding. When the padding is exceeded, the document is moved to another location in the datafiles and new padding space is allocated. So frequent additions of data cause frequent document migrations.
Hence, it is best practice to prevent frequent updates increasing the size of the document and use references instead.
So for the example, we would change our single model and create a second one. First, the model for the book
{
_id: "9783453031456",
title:"Schismatrix",
author: "Bruce Sterling"
}
The second model for the rental would look like this
{
_id: new ObjectId(),
book: "9783453031456",
rentee: "Markus Mahlberg",
start: ISODate("2015-05-05T03:22:00Z"),
due: ISODate("2015-05-05T12:00:00Z"),
returned: ISODate("2015-05-05T11:59:59.999Z")
}
The same approach of course could be used for author or rentee.
The problem with over normalization
Let's look back some time. A developer would identify the entities involved into a business case, define their properties and relations, write the according entity classes, bang his head against the wall for a few hours to get the triple inner-outer-above-and-beyond JOIN working required for the use case and all lived happily ever after. So why use NoSQL in general and MongoDB in particular? Because nobody lived happily ever after. This approach scales horribly and almost exclusively the only way to scale is vertical.
But the main difference of NoSQL is that you model your data according to the questions you need to get answered.
That being said, let's look at a typical n:m relation and take the relation from authors to books as our example. In SQL, you'd have 3 tables: two for your entities (books and authors) and one for the relation (Who is the author of which book?). Of course, you could take those tables and create their equivalent collections. But, since there are no JOINs in MongoDB, you'd need three queries (one for the first entity, one for its relations and one for the related entities) to find the related documents of an entity. This wouldn't make sense, since the three table approach for n:m relations was specifically invented to overcome the strict schemas SQL databases enforce.
Since MongoDB has a flexible schema, the first question would be where to store the relation, keeping the problems arising from overuse of embedding in mind. Since an author might write quite a few books in the years coming, but the authorship of a book rarely, if at all, changes, the answer is simple: We store the authors as a reference to the authors in the books data
{
_id: "9783453526723",
title: "The Difference Engine",
authors: ["idOfBruceSterling","idOfWilliamGibson"]
}
And now we can find the authors of that book by doing two queries:
var book = db.books.findOne({title:"The Difference Engine"})
var authors = db.authors.find({_id: {$in: book.authors})
I hope the above helps you to decide when to actually "split" your collections and to get around the most common pitfalls.
Conclusion
As to your questions, here are my answers
As written before: No, but keeping the technical limitations in mind should give you an idea when it could make sense.
It is not bad – as long as it fits your use case(s). If you have a given category and its _id, it is easy to find the related products. When loading the product, you can easily get the categories it belongs to, even efficiently so, as _id is indexed by default.
I have yet to find a use case which can't be done with MongoDB, though some things can get a bit more complicated with MongoDB. What you should do imho is to take the sum of your functional and non functional requirements and check wether the advantages outweigh the disadvantages. My rule of thumb: if one of "scalability" or "high availability/automatic failover" is on your list of requirements, MongoDB is worth more than a look.
The very "first" thing to consider when choosing an "NoSQL" solution for storage over an "Relational" solution is that things "do not work in the same way" and therefore respond differently by design.
More specifically, solutions such as MongoDB are "not meant" to "emulate" the "relational join" structure that is present in many SQL and therefore "relational" backends, and that they are moreover intended to look at data "joins" in a very different way.
This arrives at your "questions" as follows:
There really is no set "rule", and understand that the "rules" of denormalization do not apply here for the basic reason of why NoSQL solutions exist. And that is to offer something "different" that may work well for your situation.
Is it bad? Is it Good? Both are subjective. Considering point "1" here, there is the basic consideration that "non-relational" or "NoSQL" databases are designed to do things "differently" than a relational system is. So therefore there is usually a "penalty" to "emulating joins" in a relational manner. Specifically for MongoDB this means "additional requests". But that does not mean you "cannot" or "should not" do that. Rather it is all about how your usage pattern works for your application.
Re-capping on the basic points made above, NoSQL in general is designed to solve problems that do not suit the traditional SQL and/or "relational" design pattern, and therefore replace them with something else. The "ultimate goal" here is for you to "rethink your data access patterns" and evolve your application to use a storage model that is more suited to how you access it in your application usage.
In short, there are no strict rules, and that is also part of the point in moving away from "nth-normal-form" rules. NoSQL solutions such as MongoDB allow for "nested structure" storage that typical SQL/Relational solutions do not provide in an efficient form.
Another side of this is considering that operations such as "joins" do not "scale" well over "big data" forms, therefore there exists the different way to "join" by offering concepts such as "embedded data structures", such as MongoDB does.
You would do well to real some guides on the subjects of how many NoSQL solutions approach storing and accessing data. This is ultimately what you need to decide on to determine which is best for you and your application.
At the end of the day, it should be about realising when a SQL/Relational model does not meet your needs, and then choosing something else.

Cassandra or mongodb or something else for big online sales site

Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...

Querying with Redis?

I've been learning Node.js so I decided to make a simple ad network, but I can't seem to decide on a database to use. I've been messing around with Redis but I can't seem to find a way to query the database by specific criteria, instead I can only get the value of a key or a list or set inside a key.
Am I missing something, or should I be using a more robust database like MongoDB?
I would recommend to read this tutorial about Redis in order to understand its concepts and data types. I also had problems to understand why there is no querying support similar to other (no) SQL databases until I read few articles and try to test and compare Redis with other solutions. Maybe it isn't the right database for your use case, although it is very fast and supports advanced data structures, but lacks querying which is crucial for you. If you are looking for a database which allows you to query your data then you should try mongodb or maybe riak.
Redis is often referred to as a data
structure server since keys can
contain strings, hashes, lists, sets
and sorted sets.
If able(easy to implement) you should use these primitives(strings,hashes,lists,set and sorted sets). The main advantage of Redis is that is lightning fast, but that it is rather primitive key-value store(redis is a little bit more advanced). This also means that it can not be queried like for example SQL.
It would probably be easier to use a more advanced store, like for example Mongodb, which is a document-oriented database. The trade-off you make in this case is PERFORMANCE, but I believe you should only tackle that if that is becoming a problem, which it probably will not be because Mongodb is also pretty fast and has the advantage that it can be queried. I think it would be advisable to have proper indexes for your queries(read>write) to make it fast.
I think that the main answer comes from the data structure. Check this article about NoSQL Data Modelling, for me it was very helpful: NoSql Data Modelling.
A second good article ever about Data Modeling, and making a comparison between SQL and NoSQL is the following: The Relational model anti pattern.

Is NoSQL ideal to store stats?

I'm not terribly familiar with NoSQL systems, but I remember reading a while back that they are ideal to handle statistical data.
Since I'm about to start writing code that will record data like "how many users were registered on each day", I was thinking I could use this as an opportunity to learn more about NoSQL if it fits the bill.
If NoSQL is indeed ideal for this, could you provide me with some information as to why? And which specific systems are best suited for this particular need?
So, after the first answer, maybe it's helpful to clarify a bit more.
I currently have a PostgreSQL database from which I'll get the data. It will be very simple, and no calculations needed. For example, I'll just get a resultset with the amount of users registered each day for the past month (so it'll basically just be a set of value pairs for the date/users) and save that in another table/database.
Thanks!
It kind of depends on what sorts of analysis you are going to be doing on these stats. If you are going to be doing a lot of different operations (averaging, summing, joining...) you may find NoSQL solutions to be more of a pain then they are worth.
However, if you are storing stats mostly for a display purpose, or for very specific analysis routines, NoSQL solutions start to shine.
If your data is small enough, stick with a SQL solution, which will give the benefit of a full query engine to work with, but if you have lots of values (one value a day is nothing, even if you were running for a million years), and are worried about storage size and performance, NoSQL options once again may be worth it.
If your data is semi-structured, take a look at CouchDB, which offers some rudimentary indexing and querying support, which could provide some basis for analysis routines. If you are storing individual values with very little structure, my best advice would be to take a look at Tokyo Cabinet and Tokyo Tyrant, which are absolutely incredible options for key-value storage.
NoSQL systems tend to optimize the case where data is stored frequently, but accessed infrequently. In the case of statistics, you might gather lots of data from a (social) site frequently in small bits, which is optimized for. But retrieval and analysis might be slower... It of course depends on which "NoSql" System you decide to use.

Resources