Atomic probabilistic counting and set membership in Cassandra

Atomic probabilistic counting and set membership in Cassandra - cassandra

I am looking to do probabilistic counting and set membership using structures such as bloom filters and hyperloglog.
Is there any support for using such data structures and performing operations on them atomically on the server-side, through user-defined functions or similar? Or any way for me to add extensions with such functionality?
(I could ingest the data through another system and batch the updates to reduce the contention, but it would be far simpler if all this could be handled in the database server.)

You have to implement them client side. Common approach is to every X min serialize/insert the HLL you keep in memory on your system and then merge them on reads across interested range (maybe using RRD type approach for different periods beyond X min). This is not very durable, so depending on usecase it might mean something more complex.
Although it seems a close fit to C* I think one of the big issues is deletes, but you can probably work around them. Theres a proof of concept for C* side implementation here:
http://vilkeliskis.com/blog/2013/12/28/hacking_cassandra.html
that you can likely get working "well enough". https://issues.apache.org/jira/browse/CASSANDRA-8861 may be something to watch.

Related

Understanding Cassandra - can it replace RDBMS? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've spent the last week cramming on Cassandra, trying to understand the basics, as well as if it fits our needs, or not. I think I understand it on a basic level at this point, but if it works like I believe I'm being told...I just can't tell if it's a good fit.
We have a microservices platform which is essentially a large data bus between our customers. They use a set of APIs to push and pull shared data. The filtering, thus far, is pretty simple...but there's no way to know what the future may bring.
On top of this platform is an analytics layer with several visualizations (bar charts, graphs, etc.) based on the data being passed around.
The microservices platform was built atop MySQL with the idea that we could use clustering, which we honestly did not have a lot of luck with. On top of that, changes are painful, as is par for the course in the RDBMS world. Also, we expect extraordinary amounts of data with thousands-upon-thousands of concurrent users - it seems that we'll have an inevitable scaling problem.
So, we began looking at Cassandra as a distributed nosql potential replacement.
I watched the DataStax videos, took a course on another site, and started digging in. What I'm finding is:
Data is stored redundantly across several tables, each of which uses different primary and clustering keys, to enable different types of queries, since rows are scattered across different nodes in the cluster
Rather than joining, which isn't supported, you'd denormalize and create "wide" tables with tons of columns
Data is eventually consistent, so new writes may not be readily readable in a predictable, reasonable amount of time.
CQL, while SQL-like, is mostly a lie. How you store and key data determines which types of queries you can use. It seems very limited and inflexible.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs. If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
I want to like this idea and love the distributed features, but frankly am mostly scared off, at this point. I feel like I've learned a lot and nothing at all, in the last week, and am entirely unsure how to proceed.
I looked into JanusGraph, Elassandra, etc. to see if that would provide a simpler interface on top of Cassandra, relegating it to basically a storage engine, but am not confident many of these things are mature enough or even proper, for what we need.
I suppose I'm looking for direction and insight from those of you who have built things w/ Cassandra, to see if it's a fit for what we're doing. I'm out of R&D time, unfortunately. Thanks!

Understanding Cassandra - can it replace RDBMS?
The short answer here, is "NO." Cassandra is not a simple drop-in replacement for a RDBMS, when you suddenly need it to scale.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs.
It fits long-term database needs if you're applying it to the right use case.
DISCLAIMER: I am a bit of a Cassandra zealot. I've used it for a while, made minor contributions to the project, been named a "Cassandra MVP," and even co-authored a book about it. I think it's a great piece of tech, and you can do amazing things with it.
That being said, there are a lot of things that it's just not good at:
Query flexibility. The tradeoff you make for spreading rows across multiple nodes to meet operational scale, is that you have to know your query patterns ahead of time, and then follow them strictly. The idea, is that you want to have all queries served by a single node. And you'll have to put some thought into your data model to achieve that. Unbound queries (SELECTs without WHERE clauses) become the enemy.
Updating data in-place. Plan on storing values by a key, but then updating them a lot (ex: status)? Cassandra is not a good fit for that. This is because Cassandra has a log-based storage engine which doesn't overwrite anything...it just obsoletes it. So your previous values are still there, and still take up space and compute resources.
Deleting Data. Deleting data in the distributed database world is tricky. After all, how do you replicate nothing to another node? Cassandra's answer to that problem, is to use a structure called a tombstone. Tombstones take up space, can slow performance, and need to stay around long enough to replicate (making their removal tricky).
Maintaining Data Consistency. Being highly-available and partition tolerant, Cassandra embraces the concept of "eventual consistency." So it should come as no surprise that it really wasn't designed to be consistent. It has a lot of mechanisms which will help keep data consistent, but they are far from perfect. Plus, there really isn't a way to know for sure if your data is in sync or not.
If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
Materialized views are something that I'd continue to stay away from for the foreseeable future. They're "experimental" for a reason. Basically, once they're out of sync, the only way to get them back in sync is to rebuild them.
I coach my dev teams on keeping their query tables (tables containing the same data, just keyed differently) in sync with BATCH statements. In fact, BATCH is a misnomer as it probably should have bene named "ATOMIC" instead. Because of its name, it is heavily mis-used, and its mis-use can lead to problems. But, it does keep mutations applied atomically, so that does help.
Basically, scrutinize your database requirements. If Cassandra doesn't cut it, then try to find one which does. CockroachDB (or one of the other NewSQLs) might be a better fit for what you're talking about. It tries to be a drop-in for Postgres, and it scales with some Cassandra-like mechanisms, so it might be worth looking into.

Cassandra is very good at what it does but it is not a drop-in replacement for an RDBMS. If you find that you need any of the following, I would not encourage you to migrate to Cassandra:
Strict consistency
ACID transactions
Support for ad-hoc queries, including joins, aggregates, etc.
Now as for you hitting some limits (or thinking you will hit them in the future) with MySQL, here are some thoughts:
Don't think that a limitation in MySQL is a limitation in RDBMS in general. Just so you don't think I am a $some_other_DB zealot, I've been using MySQL for almost 20 years, but it is not the best tool for all jobs.
If by 'changes' you mean 'schema changes', a lot of the pain can be alleviated by either:
Using an RDBMS where they are implemented better (including perhaps a more recent MySQL version)
Using community supported tools such as pt-online-schema-change or gh-ost
Good luck!

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?

I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )

Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.

If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

Transaction with Cassandra data model

According to the CAP theory, Cassandra can only have eventually consistency. To make things worse, if we have multiple reads and writes during one request without proper handling, we may even lose the logical consistency. In other words, if we do things fast, we may do it wrong.
Meanwhile the best practice to design the data model for Cassandra is to think about the queries we are going to have, and then add a CF to it. In this way, to add/update one entity means to update many views/CFs in many cases. Without atomic transaction feature, it's hard to do it right. But with it, we lose the A and P parts again.
I don't see this concerns many people, hence I wonder why.
Is this because we can always find a way to design our data model to avoid to do multiple reads and writes in one session?
Is this because we can just ignore the 'right' part?
In real practice, do we always have ACID feature somewhere in the middle? I mean maybe implement in application layer or add a middleware to handle it?

It does concern people, but presumably you are using cassandra because a single database server is unable to meet your needs due to scaling or reliability concerns. Because of this, you are forced to work around the limitations of a distributed system.
In real practice, do we always have ACID feature somewhere in the
middle? I mean maybe implement in application layer or add a
middleware to handle it?
No, you don't usually have acid somewhere else, as presumably that somewhere else must be distributed over multiple machines as well. Instead, you design your application around the limitations of a distributed system.
If you are updating multiple columns to satisfy queries, you can look at the eventually atomic section in this presentation for ideas on how to do that. Basically you write enough info about your update to cassandra before you do your write. That way if the write fails, you can retry it later.
If you can structure your application in such a way, using a co-ordination service like Zookeeper or cages may be useful.

Space efficient embedded Haskell persistence solution

I'm looking for a persistence solution (maybe a NoSQL db? or something else...) that has the following criteria:
1) Has a Haskell API
2) Is disk space efficient--the db could easily get to many gigabytes of data but I need it to run well on a typical desktop. I need something that stores the data as efficiently as possible. So, for example, storing field names in a record would be bad.
3) High performance for reading sequential records. The typical use case is start somewhere and then read forward straight through the data--reading through possibly millions of records as quickly as possible.
4) Data is basically never changed (would only be changed if it was discovered data was incorrect somehow), just logged
5) It should act directly on file(s) that can be easily moved/copied around. It should not be calling a separate running server.

If you remove the "single file" requirement with no other running process, everything else can be fulfilled by every standard RDBMS, and depending on the type of data, sometimes especially well by columnar stores in particular.
The only single-file solution I know of is sqlite. Mainly sqlite founders when a single db needs to be accessed by multiple concurrent processes. If that isn't the case, then I wouldn't be surprised if you could scale it up singificantly.
Additionally, if you're only looking for sequential scans and key-value stores, you could just go with berkeleydb, which is known to be high-performance for very large data sets.
There are high quality Haskell bindings for talking to both sqlite and berkeleydb.
Edit: For sequential access only, its also blindingly straightforward to roll your own layer with the binary or cereal packages -- you basically need to write a helper function to wrap reading records from a file sequentially rather than all at once. An abstraction for folding over them is nice as well. Then you can decide to append to a single file, or spread your writes across files as you go. Either way, that's the most lightweight and straightforward option of all. The only drawback is having to worry about durability -- safe writes in the presence of interrupts, and all that other stuff that a good DB solution should take care of for you.

CouchDB ticks most of your boxes:
1) http://hackage.haskell.org/package/CouchDB
2) Depends on how you use it. You can store any binary data in it, but its up to you to know what it means. Or you can store XML or JSON, which is less space efficient but easier to migrate as your schema evolves (which it will).
3) Don't know, but its used for big web sites.
4) CouchDB uses a CM-like concept of updates and baselines, so old data stays around. It can be purged later as obsolete, but I think thats optional.
5) No. Its written in Erlang and runs (I believe) as a separate process. But why is that a problem?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string