Transaction with Cassandra data model - cassandra

According to the CAP theory, Cassandra can only have eventually consistency. To make things worse, if we have multiple reads and writes during one request without proper handling, we may even lose the logical consistency. In other words, if we do things fast, we may do it wrong.
Meanwhile the best practice to design the data model for Cassandra is to think about the queries we are going to have, and then add a CF to it. In this way, to add/update one entity means to update many views/CFs in many cases. Without atomic transaction feature, it's hard to do it right. But with it, we lose the A and P parts again.
I don't see this concerns many people, hence I wonder why.
Is this because we can always find a way to design our data model to avoid to do multiple reads and writes in one session?
Is this because we can just ignore the 'right' part?
In real practice, do we always have ACID feature somewhere in the middle? I mean maybe implement in application layer or add a middleware to handle it?

It does concern people, but presumably you are using cassandra because a single database server is unable to meet your needs due to scaling or reliability concerns. Because of this, you are forced to work around the limitations of a distributed system.
In real practice, do we always have ACID feature somewhere in the
middle? I mean maybe implement in application layer or add a
middleware to handle it?
No, you don't usually have acid somewhere else, as presumably that somewhere else must be distributed over multiple machines as well. Instead, you design your application around the limitations of a distributed system.
If you are updating multiple columns to satisfy queries, you can look at the eventually atomic section in this presentation for ideas on how to do that. Basically you write enough info about your update to cassandra before you do your write. That way if the write fails, you can retry it later.
If you can structure your application in such a way, using a co-ordination service like Zookeeper or cages may be useful.

Related

Is Cassandra just a storage engine?

I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.

Understanding Cassandra - can it replace RDBMS? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've spent the last week cramming on Cassandra, trying to understand the basics, as well as if it fits our needs, or not. I think I understand it on a basic level at this point, but if it works like I believe I'm being told...I just can't tell if it's a good fit.
We have a microservices platform which is essentially a large data bus between our customers. They use a set of APIs to push and pull shared data. The filtering, thus far, is pretty simple...but there's no way to know what the future may bring.
On top of this platform is an analytics layer with several visualizations (bar charts, graphs, etc.) based on the data being passed around.
The microservices platform was built atop MySQL with the idea that we could use clustering, which we honestly did not have a lot of luck with. On top of that, changes are painful, as is par for the course in the RDBMS world. Also, we expect extraordinary amounts of data with thousands-upon-thousands of concurrent users - it seems that we'll have an inevitable scaling problem.
So, we began looking at Cassandra as a distributed nosql potential replacement.
I watched the DataStax videos, took a course on another site, and started digging in. What I'm finding is:
Data is stored redundantly across several tables, each of which uses different primary and clustering keys, to enable different types of queries, since rows are scattered across different nodes in the cluster
Rather than joining, which isn't supported, you'd denormalize and create "wide" tables with tons of columns
Data is eventually consistent, so new writes may not be readily readable in a predictable, reasonable amount of time.
CQL, while SQL-like, is mostly a lie. How you store and key data determines which types of queries you can use. It seems very limited and inflexible.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs. If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
I want to like this idea and love the distributed features, but frankly am mostly scared off, at this point. I feel like I've learned a lot and nothing at all, in the last week, and am entirely unsure how to proceed.
I looked into JanusGraph, Elassandra, etc. to see if that would provide a simpler interface on top of Cassandra, relegating it to basically a storage engine, but am not confident many of these things are mature enough or even proper, for what we need.
I suppose I'm looking for direction and insight from those of you who have built things w/ Cassandra, to see if it's a fit for what we're doing. I'm out of R&D time, unfortunately. Thanks!
Understanding Cassandra - can it replace RDBMS?
The short answer here, is "NO." Cassandra is not a simple drop-in replacement for a RDBMS, when you suddenly need it to scale.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs.
It fits long-term database needs if you're applying it to the right use case.
DISCLAIMER: I am a bit of a Cassandra zealot. I've used it for a while, made minor contributions to the project, been named a "Cassandra MVP," and even co-authored a book about it. I think it's a great piece of tech, and you can do amazing things with it.
That being said, there are a lot of things that it's just not good at:
Query flexibility. The tradeoff you make for spreading rows across multiple nodes to meet operational scale, is that you have to know your query patterns ahead of time, and then follow them strictly. The idea, is that you want to have all queries served by a single node. And you'll have to put some thought into your data model to achieve that. Unbound queries (SELECTs without WHERE clauses) become the enemy.
Updating data in-place. Plan on storing values by a key, but then updating them a lot (ex: status)? Cassandra is not a good fit for that. This is because Cassandra has a log-based storage engine which doesn't overwrite anything...it just obsoletes it. So your previous values are still there, and still take up space and compute resources.
Deleting Data. Deleting data in the distributed database world is tricky. After all, how do you replicate nothing to another node? Cassandra's answer to that problem, is to use a structure called a tombstone. Tombstones take up space, can slow performance, and need to stay around long enough to replicate (making their removal tricky).
Maintaining Data Consistency. Being highly-available and partition tolerant, Cassandra embraces the concept of "eventual consistency." So it should come as no surprise that it really wasn't designed to be consistent. It has a lot of mechanisms which will help keep data consistent, but they are far from perfect. Plus, there really isn't a way to know for sure if your data is in sync or not.
If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
Materialized views are something that I'd continue to stay away from for the foreseeable future. They're "experimental" for a reason. Basically, once they're out of sync, the only way to get them back in sync is to rebuild them.
I coach my dev teams on keeping their query tables (tables containing the same data, just keyed differently) in sync with BATCH statements. In fact, BATCH is a misnomer as it probably should have bene named "ATOMIC" instead. Because of its name, it is heavily mis-used, and its mis-use can lead to problems. But, it does keep mutations applied atomically, so that does help.
Basically, scrutinize your database requirements. If Cassandra doesn't cut it, then try to find one which does. CockroachDB (or one of the other NewSQLs) might be a better fit for what you're talking about. It tries to be a drop-in for Postgres, and it scales with some Cassandra-like mechanisms, so it might be worth looking into.
Cassandra is very good at what it does but it is not a drop-in replacement for an RDBMS. If you find that you need any of the following, I would not encourage you to migrate to Cassandra:
Strict consistency
ACID transactions
Support for ad-hoc queries, including joins, aggregates, etc.
Now as for you hitting some limits (or thinking you will hit them in the future) with MySQL, here are some thoughts:
Don't think that a limitation in MySQL is a limitation in RDBMS in general. Just so you don't think I am a $some_other_DB zealot, I've been using MySQL for almost 20 years, but it is not the best tool for all jobs.
If by 'changes' you mean 'schema changes', a lot of the pain can be alleviated by either:
Using an RDBMS where they are implemented better (including perhaps a more recent MySQL version)
Using community supported tools such as pt-online-schema-change or gh-ost
Good luck!

Atomic probabilistic counting and set membership in Cassandra

I am looking to do probabilistic counting and set membership using structures such as bloom filters and hyperloglog.
Is there any support for using such data structures and performing operations on them atomically on the server-side, through user-defined functions or similar? Or any way for me to add extensions with such functionality?
(I could ingest the data through another system and batch the updates to reduce the contention, but it would be far simpler if all this could be handled in the database server.)
You have to implement them client side. Common approach is to every X min serialize/insert the HLL you keep in memory on your system and then merge them on reads across interested range (maybe using RRD type approach for different periods beyond X min). This is not very durable, so depending on usecase it might mean something more complex.
Although it seems a close fit to C* I think one of the big issues is deletes, but you can probably work around them. Theres a proof of concept for C* side implementation here:
http://vilkeliskis.com/blog/2013/12/28/hacking_cassandra.html
that you can likely get working "well enough". https://issues.apache.org/jira/browse/CASSANDRA-8861 may be something to watch.

Scratch couchdb document

Is it possible to "scratch" a couchdb document? By that I mean to delete a document, and make sure that the document and its history is completely removed from the database.
I do not want to perform a database compaction, I just want to fully wipe out a single document. And I am looking for a solution that guarantees that there is no trace of the document in the database, without needing to wait for internal database processes to eventually remove the document.
(a python solution is appreciated)
When you delete a document in CouchDB, generally only the _id, _rev, and a deleted flag are preserved. These are preserved to allow for eventual consistency through replication. Forcing an immediate delete across an entire group of nodes isn't really consistent with the architecture.
The closest thing would be to use purge; once you do that, all traces of the doc will be gone after the next compaction. I realize this isn't exactly what you're asking for, but it's the closest thing off the top of my head.
Here's a nice article explaining the basis behind the various delete methods available.
Deleting anything from file system for sure is difficult, and usually quite expensive problem. Even more with databases in general. Depending of what for sure means to you, you may end up with custom db, custom os and custom hw. So it is kind of saying I want fault tolerant system, yes everyone would like to have one, but only few can afford it, but good news is that most can settle for less. Similar is for deleteing for sure, I assume you are trying to adress some security or privacy issue, so try to see if there is some other way to get what you need. Perhaps encrypting the document or sensitive parts of it.

Space efficient embedded Haskell persistence solution

I'm looking for a persistence solution (maybe a NoSQL db? or something else...) that has the following criteria:
1) Has a Haskell API
2) Is disk space efficient--the db could easily get to many gigabytes of data but I need it to run well on a typical desktop. I need something that stores the data as efficiently as possible. So, for example, storing field names in a record would be bad.
3) High performance for reading sequential records. The typical use case is start somewhere and then read forward straight through the data--reading through possibly millions of records as quickly as possible.
4) Data is basically never changed (would only be changed if it was discovered data was incorrect somehow), just logged
5) It should act directly on file(s) that can be easily moved/copied around. It should not be calling a separate running server.
If you remove the "single file" requirement with no other running process, everything else can be fulfilled by every standard RDBMS, and depending on the type of data, sometimes especially well by columnar stores in particular.
The only single-file solution I know of is sqlite. Mainly sqlite founders when a single db needs to be accessed by multiple concurrent processes. If that isn't the case, then I wouldn't be surprised if you could scale it up singificantly.
Additionally, if you're only looking for sequential scans and key-value stores, you could just go with berkeleydb, which is known to be high-performance for very large data sets.
There are high quality Haskell bindings for talking to both sqlite and berkeleydb.
Edit: For sequential access only, its also blindingly straightforward to roll your own layer with the binary or cereal packages -- you basically need to write a helper function to wrap reading records from a file sequentially rather than all at once. An abstraction for folding over them is nice as well. Then you can decide to append to a single file, or spread your writes across files as you go. Either way, that's the most lightweight and straightforward option of all. The only drawback is having to worry about durability -- safe writes in the presence of interrupts, and all that other stuff that a good DB solution should take care of for you.
CouchDB ticks most of your boxes:
1) http://hackage.haskell.org/package/CouchDB
2) Depends on how you use it. You can store any binary data in it, but its up to you to know what it means. Or you can store XML or JSON, which is less space efficient but easier to migrate as your schema evolves (which it will).
3) Don't know, but its used for big web sites.
4) CouchDB uses a CM-like concept of updates and baselines, so old data stays around. It can be purged later as obsolete, but I think thats optional.
5) No. Its written in Erlang and runs (I believe) as a separate process. But why is that a problem?

Resources