Can Druid replace Cassandra? [closed] - cassandra

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I cant help think that there aren't many use case that can be effectively served by Cassandra better than Druid. As a time series store or key value, queries can be written in Druid to extract data however needed.
The argument here is more around justifying Druid than Cassandra.
Apart from the Fast writes in Cassandra, is there really anything else ? Esp given the real time aggregations/and querying capabilities of Druid, does it not outweigh Cassandra.
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?

For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
Not at all, they are not comparable. We are talking about two very different technologies here. Easy way is to see Cassandra as a distributed storage solution, but Druid a distributed aggregator (i.e. an awesome open-source OLAP-like tool (: ). The post you are referring to, in my opinion, is a bit misleading in the sense that it compares the two projects in the world of data mining, which is not cassandra's focus.
Druid is not good at point lookup, at all. It loves time series and its partitioning is mainly based on date-based segments (e.g. hourly/monthly etc. segments that may be furthered sharded based on size).
Druid pre-aggregates your data based on pre-defined aggregators -- which are numbers (e.g. Sum the number of click events in your website with a daily granularity, etc.). If one wants to store a key lookup from a string to say another string or an exact number, Druid is the worst solution s/he can look for.

Not sure this is really a SO type of question, but the easy answer is that it's a matter of use case. Simply put, Druid shines when it facilitates very fast ad-hoc queries to data that has been ingested in real time. It's read consistent now and you are not limited by pre-computed queries to get speed. On the other hand, you can't write to the data it holds, you can only overwrite.
Cassandra (from what I've read; haven't used it) is more of an eventually consistent data store that supports writes and does very nicely with pre-compute. It's not intended to continuously ingest data while providing real-time access to ad-hoc queries to that same data.
In fact, the two could work together, as has been proposed on planetcassandra.org in "Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine!".

It depends on the use case . For example I was using Cassandra for aggregation purpose i.e. stats like aggregated number of domains w.r.t. users ,departments etc . Events trends (bandwidth,users,apps etc ) with configurable time windows . Replacing Cassandra with Druid worked out very well for me because druid is super efficient with aggregations .On the other hand if you need timeseries data with eventual consistency Cassandra is better ,Where you can get details of the events .
Combination of Druid and Elasticsearch worked out very well to remove Cassandra from our Big Dada infrastructure
.

Related

rdb vs key-value store for django functionality [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
When would one choose a key-value data store over a relational DB? What considerations go into deciding one or the other? When is mix of both the best route? Please provide examples if you can.
Key-value, heirarchical, map-reduce, or graph database systems are much closer to implementation strategies, they are heavily tied to the physical representation. The primary reason to choose one of these is if there is a compelling performance argument and it fits your data processing strategy very closely. Beware, ad-hoc queries are usually not practical for these systems, and you're better off deciding on your queries ahead of time.
Relational database systems try to separate the logical, business-oriented model from the underlying physical representation and processing strategies. This separation is imperfect, but still quite good. Relational systems are great for handling facts and extracting reliable information from collections of facts. Relational systems are also great at ad-hoc queries, which the other systems are notoriously bad at. That's a great fit in the business world and many other places. That's why relational systems are so prevalent.
If it's a business application, a relational system is almost always the answer. For other systems, it's probably the answer. If you have more of a data processing problem, like some pipeline of things that need to happen and you have massive amounts of data, and you know all of your queries up front, another system may be right for you.
If your data is simply a list of things and you can derive a unique identifier for each item, then a KVS is a good match. They are close implementations of the simple data structures we learned in freshman computer science and do not allow for complex relationships.
A simple test: can you represent your data and all of its relationships as a linked list or hash table? If yes, a KVS may work. If no, you need an RDB.
You still need to find a KVS that will work in your environment. Support for KVSes, even the major ones, is nowhere near what it is for, say, PostgreSQL and MySQL/MariaDB.
IMO, Key value pair (e.g. NoSQL databases) works best when the underlying data is unstructured, unpredictable, or changing often. If you don't have structured data, a relational database is going to be more trouble than its worth because you will need to make lots of schema changes and/or jump through hoops to conform your data to the structure.
KVP / JSON / NoSql is great because changes to the data structure do not require completely refactoring the data model. Adding a field to your data object is simply a matter of adding it to the data. The other side of the coin is there are fewer constraints and validation checks in a KVP / Nosql database than a relational database so your data might get messy.
There are performance and space saving benefits for relational data models. Normalized relational data can make understanding and validating the data easier because there are table key relationships and constraints to help you out.
One of the worst patterns i've seen is trying to have it both ways. Trying to put a key-value pair into a relational database is often a recipe for disaster. I would recommend using the technology that suits your data foremost.
If you want O(1) lookups of values based on keys, then you want a KV store. Meaning, if you have data of the form k1={foo}, k2={bar}, etc, even when the values are larger/ nested structures, and want fast lookups, you want a KV store.
Even with proper indexing, you cannot achieve O(1) lookups in a relational DB for arbitrary keys. Sometimes this is referred to as "random lookups".
Alliteratively stated, if you only ever query by one column, a "primary key" if you will, to retrieve the rest of the data, then using that column as a keyspace and the rest of the data as a value in a KV store is the most efficient way to do lookups.
In contrast, if you often query the data by any of several columns, aka you support a richer query API for the data, then you may want a relational database.
A traditional relational database has problems scaling beyond a point. Where that point is depends a bit on what you are trying to do.
All (most?) of the suppliers of cloud computing are providing key-value data stores.
However, if you have a reasonably sized application with a complicated data structure, then the support that you get from using a relational database can reduce your development costs.
In my experience, if you're even asking the question whether to use traditional vs esoteric practices, then go traditional. While esoteric practices are sexy, challenging, and fun, 99.999% of applications call for a traditional approach.
With regards to relational vs KV, the question you should be asking is:
Why would I not want to use a relational model for this scenario: ...
Since you have not described the scenario, it's impossible for anyone to tell you why you shouldn't use it. The "catch all" reason for KV is scalability, which isn't a problem now. Do you know the rules of optimization?
Don't do it.
(for experts only) Don't do it now.
KV is a highly optimized solution to scalability that will most likely be completely unecessary for your application.

Understanding Cassandra - can it replace RDBMS? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've spent the last week cramming on Cassandra, trying to understand the basics, as well as if it fits our needs, or not. I think I understand it on a basic level at this point, but if it works like I believe I'm being told...I just can't tell if it's a good fit.
We have a microservices platform which is essentially a large data bus between our customers. They use a set of APIs to push and pull shared data. The filtering, thus far, is pretty simple...but there's no way to know what the future may bring.
On top of this platform is an analytics layer with several visualizations (bar charts, graphs, etc.) based on the data being passed around.
The microservices platform was built atop MySQL with the idea that we could use clustering, which we honestly did not have a lot of luck with. On top of that, changes are painful, as is par for the course in the RDBMS world. Also, we expect extraordinary amounts of data with thousands-upon-thousands of concurrent users - it seems that we'll have an inevitable scaling problem.
So, we began looking at Cassandra as a distributed nosql potential replacement.
I watched the DataStax videos, took a course on another site, and started digging in. What I'm finding is:
Data is stored redundantly across several tables, each of which uses different primary and clustering keys, to enable different types of queries, since rows are scattered across different nodes in the cluster
Rather than joining, which isn't supported, you'd denormalize and create "wide" tables with tons of columns
Data is eventually consistent, so new writes may not be readily readable in a predictable, reasonable amount of time.
CQL, while SQL-like, is mostly a lie. How you store and key data determines which types of queries you can use. It seems very limited and inflexible.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs. If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
I want to like this idea and love the distributed features, but frankly am mostly scared off, at this point. I feel like I've learned a lot and nothing at all, in the last week, and am entirely unsure how to proceed.
I looked into JanusGraph, Elassandra, etc. to see if that would provide a simpler interface on top of Cassandra, relegating it to basically a storage engine, but am not confident many of these things are mature enough or even proper, for what we need.
I suppose I'm looking for direction and insight from those of you who have built things w/ Cassandra, to see if it's a fit for what we're doing. I'm out of R&D time, unfortunately. Thanks!
Understanding Cassandra - can it replace RDBMS?
The short answer here, is "NO." Cassandra is not a simple drop-in replacement for a RDBMS, when you suddenly need it to scale.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs.
It fits long-term database needs if you're applying it to the right use case.
DISCLAIMER: I am a bit of a Cassandra zealot. I've used it for a while, made minor contributions to the project, been named a "Cassandra MVP," and even co-authored a book about it. I think it's a great piece of tech, and you can do amazing things with it.
That being said, there are a lot of things that it's just not good at:
Query flexibility. The tradeoff you make for spreading rows across multiple nodes to meet operational scale, is that you have to know your query patterns ahead of time, and then follow them strictly. The idea, is that you want to have all queries served by a single node. And you'll have to put some thought into your data model to achieve that. Unbound queries (SELECTs without WHERE clauses) become the enemy.
Updating data in-place. Plan on storing values by a key, but then updating them a lot (ex: status)? Cassandra is not a good fit for that. This is because Cassandra has a log-based storage engine which doesn't overwrite anything...it just obsoletes it. So your previous values are still there, and still take up space and compute resources.
Deleting Data. Deleting data in the distributed database world is tricky. After all, how do you replicate nothing to another node? Cassandra's answer to that problem, is to use a structure called a tombstone. Tombstones take up space, can slow performance, and need to stay around long enough to replicate (making their removal tricky).
Maintaining Data Consistency. Being highly-available and partition tolerant, Cassandra embraces the concept of "eventual consistency." So it should come as no surprise that it really wasn't designed to be consistent. It has a lot of mechanisms which will help keep data consistent, but they are far from perfect. Plus, there really isn't a way to know for sure if your data is in sync or not.
If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
Materialized views are something that I'd continue to stay away from for the foreseeable future. They're "experimental" for a reason. Basically, once they're out of sync, the only way to get them back in sync is to rebuild them.
I coach my dev teams on keeping their query tables (tables containing the same data, just keyed differently) in sync with BATCH statements. In fact, BATCH is a misnomer as it probably should have bene named "ATOMIC" instead. Because of its name, it is heavily mis-used, and its mis-use can lead to problems. But, it does keep mutations applied atomically, so that does help.
Basically, scrutinize your database requirements. If Cassandra doesn't cut it, then try to find one which does. CockroachDB (or one of the other NewSQLs) might be a better fit for what you're talking about. It tries to be a drop-in for Postgres, and it scales with some Cassandra-like mechanisms, so it might be worth looking into.
Cassandra is very good at what it does but it is not a drop-in replacement for an RDBMS. If you find that you need any of the following, I would not encourage you to migrate to Cassandra:
Strict consistency
ACID transactions
Support for ad-hoc queries, including joins, aggregates, etc.
Now as for you hitting some limits (or thinking you will hit them in the future) with MySQL, here are some thoughts:
Don't think that a limitation in MySQL is a limitation in RDBMS in general. Just so you don't think I am a $some_other_DB zealot, I've been using MySQL for almost 20 years, but it is not the best tool for all jobs.
If by 'changes' you mean 'schema changes', a lot of the pain can be alleviated by either:
Using an RDBMS where they are implemented better (including perhaps a more recent MySQL version)
Using community supported tools such as pt-online-schema-change or gh-ost
Good luck!

Query-Driven Modelling and Big Data

I was watching one of the Cassandra videos on DataSax Academy. One concept they talk a lot about is query driven modelling. This makes sense when you know your queries upfront like in the KillrVideo example.
However, in big data cases, I hope I am not the only one to think that we barely know what kind of queries analysts will perform on the data 5 months or one year down the road.
If this is the case, what are the best practices for storing your data? My guess is that for advanced querying of such data, you likely will end up loading your data into Spark. But what do I have to consider at storage time to avoid operational troubles and troubles at retrieval time? What retrieval approaches are less problematic?
Cassandra is also a database for analytics use cases, but not always for Ad-Hoc Analaytics (Only one report and this query will never perform again stuff).
For this use cases is a hadoop cluster a better option for your. (Maybe parquete on hadoop) If you see that queries will perform over and over again, Cassandra is your friend. Generally you can use Cassandra for 50 to 70% of your use cases. With column keys and secondary indizies you can perform really a wide spectrum of queries. Go to your Analytics Guys and ask them what they need. Then: Create your tables :)
Datastax has a course on doing analysis on Cassandra with Apache Spark.

Querying big data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working with a system that takes in a 50/s 10kb write stream which runs 24 hours a day. The data is ingested via a messaging system in to a sql database, and then used in an overnight aggregation that takes around 15 hours to produce queryable data for an application.
This is currently all in sql, but we are moving to a new architecture.
The plan is to move the ingested writes in to a distributed database like Cassandra or dynamodb, and then perform the aggregation in hadoop. This makes those parts of the system scalable.
My question is, when people have this architecture, where do they put the data after the writes and aggregation have been performed so that it can be queried.
In more detail:
The query model our application uses is quite complicated, to make the data queryable in cassandra, we would have to denormalise it for all queries, this is possible, but would mean a massive growth in data size. Is this normal practice? Or would you prefer to move the data back in to sql?
We could move the data in to redshift, but this seems to be more for ad hoc data analytics, and its purpose is not to be the backend for a data analytics application. I also think the queries are too complicated in their current form to be written in an orm which is what is required for redshift.
Does this mean that I still need to put the data in to sql server?
I am looking for examples of what people are doing at the moment.
I am sorry this question is a bit abstract, please do not close it, I will add more detail. I have read lots on big data, but most articles are about the ingestion of data using messaging / workers and distributed databases, but I have not found any that show what they do with this ingested data and how it is queried from the application.
*answer to JosefN's comment: Yes, we are not planning to denormalise in to a sql db. The choice is, denormalise in to cassandra, for all clients and queries, this would probably mean 100x the current data size, as there will be so much duplication in the denormalised model. The other option is to store it as it is now, so that it is queryable, but then, is my only option a sql db?
*after more research I have more information. The best options at the moment seems to be:
store back in sql
denormalise in cassandra
use one of the real time sql engines on top of hadoop / hdfs like impala
drpc with storm
I do not have any experience with Impala or DRPC with storm, so if anyone has any info on latency and the type of queries that can be performed with these that would be great.
Please do not refer to documentation or blog posts, I know how these technologies work, I only want to know if someone has used them in production and has their own information on this subject. thanks
I would suggest moving the aggregated data into HDFS. Using Hive, which provides a relational view over data stored inside HDFS, you can very well use adhoc sql like queries. At the same time you will be benefitted from parallelism of MapReduce jobs that gets invoked when you use Hive. This would help you to decrease query latencies that you would be having while using a RDBMS. Also think about doing the aggregation jobs in Hadoop itself.
Since the data after aggregation is small and you are looking for good latency keeping it in hdfs and query it using hive is not preferable.
I have seen people using hbase to store aggregated data and query it but as you mentioned earlier you will have to denormalize the data. For this case I will recommend writing aggregated data back to mysql and query it there if aggregated data is not big.
I think one traditional approach is to run your Hadoop/Hive jobs to aggregate across all possible dimensions, and then store in a key/value store like HBase, and look up at runtime with a key based on the aggregation done ( ie. /state=NJ/dt=20131225/ ) This can cause an explosion in size, especially if there are many dimensions to roll up
If you want/need a more realtime solution as well, take a look at Twitter's summingbird.

Is Cassandra a good candidate database for that must sustain over 100 read/write operations per second?

Currently our system uses PostgreSQL, however we seem to have pushed the limit of its capabilities. Some of our tables need to handle over 100 read/write operations per second so it is probably time to scale horizontally across multiple machines.
Have a lot of experience using GAE's Big Table. Big Table had rich options for querying. For example, queries were possible against list data fields. Cassandra is supposed to be based off of Big Table, but if I understand correctly, for Cassandra, we will actually have to custom-code a layer on top of Cassandra that uses and maintains index tables.
Would be great if there was an open source database available for which we did not have to build our own custom logic for maintaining index tables, zig-zag merge joins, etc...
Is Cassandra a good candidate here? Or are there ones that might be considered better?
Unless the operations are huge joins or return hundreds of thousands of rows, any database you choose will be able to sustain 100 ops/s. Cassandra will have no problems serving thousands if not tens of thousands of reads and writes per node.
Without knowing more about your particular use case it's impossible to give you meaningful advice. Cassandra is a great database, but if it's right for you I don't know. I'd suggest looking through the cassandra tag here on Stack Overflow and look at what people ask about and if it looks at all like what you're trying to do, and if the answers say that it's possible with Cassandra (I know I've answered quite a few questions where the answer was that Cassandra wasn't the best choice for that particular case).
Cassandra and GAE Big Table have big similarities, but also big differences. One thing that trips up new Cassandra users is that there really isn't any way of doing things like "add this thing only unless that other thing was there" or "add an item and remove all but the last N items".

Resources