In the Cassandra documentation here it says:
While STCS works well to compact a write-intensive workload, it makes reads slower because the merge-by-size process does not group data by rows. This makes it more likely that versions of a particular row may be spread over many SSTables.
1) What does 'group data by rows' mean? Aren't all rows for a partition already grouped?
2) How is it possible for a row to have multiple versions on a single node? Doesn't the upsert behavior ensure that only the latest version of a row is accessible via the memtable and partition indices? Isn't it true that when a row is updated and the memtable flushed, the partition indices are updated to point to the latest version? Then, on compaction, this latest version (because of the row timestamp) is the one that ends up in the compacted SSTable?
Note that I'm talking about a single node here - NOT the issue of replicas being out of sync.
Either this is incorrect or I am misunderstanding what that paragraph says.
Thanks!
OK, I think I found the answer myself - I would be grateful for any confirmation that this is correct.
A row may have many versions because updates/upserts can write only part of a row. Thus, the latest version of a complete row is made up of all the latest updates for all the columns in that row - which can be spread out across multiple SSTables.
My misunderstanding seemed to stem from the idea that the partition indices can only point to one location in one SSTable. If I relax this constraint, the statement in the doc makes sense. I must therefore assume that an index in the partition indices for a primary key can hold multiple locations for that key. Can someone confirm that all this is true?
Thanks.
Related
We use a very simple key-value datamodel in Cassandra, and our partition key is in 17 SSTables. I would like to understand how read works in our concrete case.
If I undestand correctly, general Cassandra reads will need to search for the newest version of each column in the memtable and in different SSTables, until it retrieves all columns and merges them.
Since SSTables are sorted by time, and our data-model is single-column, Ideally our read operations should just hit the newest SSTable containing our partition key since this will contain the whole data.
Will our read operations hit the 17 SSTables? or just the newest one containing the searched partition key?
Cassandra will search all of them as it isn't sure which columns exist where (DML occurs at the cell level and because of that, variants can exist where reconciliation is performed). Reads are done at the partition level. However, Cassandra can filter out sstables if it knows the partition key doesn't exist in certain ones. That's why compaction is important for optimal reads - to remove the unnecessary cells.
Will our read operations hit the 17 SSTables? or just the newest one containing the searched partition key?
To add to Jim's answer, Cassandra has something called a bloom filter for this. Essentially, it's a probabilistic structure that can tell you one of two things:
The SSTable might contain the data requested.
OR
The SSTable definitely does not contain the data requested.
This should prevent Cassandra from having to scan all 17 SSTables. My advice would be to run a query with TRACING ON in cqlsh, and it'll tell you just how many SSTables it needed to look through.
In Datastax's documentation, it said:
During a write, Cassandra adds each new row to the database without
checking on whether a duplicate record exists. This policy makes it
possible that many versions of the same row may exist in the database.
As far as I understand, that means there are possibly more than 1 non-compacted SSTables that contains different versions of the same row. How does Cassandra handle duplicated data when it read data from these SSTables?
#quangh : As already stated in document :
This is why Cassandra performs another round of comparisons during a read process. When a client requests data with a particular primary key, Cassandra retrieves many versions of the row from one or more replicas. The version with the most recent timestamp is the only one returned to the client ("last-write-wins").
All the writes operation have a timestamp associated. In this case different node will have different version of same row. But during read operation Cassandra will pick row with latest timestamp. I hope this solves your query.
Given a simple CQL table which stores an ID and a Blob, is there any problem or performance impact of storing potentially billions of rows?
I know with earlier versions of Cassandra wide rows were de rigueur, but CQL seems to encourage us to move away from that. I don't have any particular requirement to ensure the data is clustered together or able to filter in any order. I'm wondering whether very many rows in a CQL table could be problematic in any way.
I'm considering binning my data, that is - creating a partition key which is a hash%n of the ID and would limit the data to n 'bins' (millions of?). Before I add that overhead I'd like to validate whether it's actually worthwhile.
First, I don't think is correct.
I know with earlier versions of Cassandra wide rows were de rigueur, but CQL seems to encourage us to move away from that.
Wide rows are supported and well. There's a post from Jonathan Ellis Does CQL support dynamic columns / wide rows?:
A common misunderstanding is that CQL does not support dynamic columns or wide rows. On the contrary, CQL was designed to support everything you can do with the Thrift model, but make it easier and more accessible.
For the part about the "performance impact of storing potentially billions of rows" I think the important part to keep in mind is the size of these rows.
According to Aaron Morton in this mail thread:
When rows get above a few 10’s of MB things can slow down, when they get above
50 MB they can be a pain, when they get above 100MB it’s a warning sign. And
when they get above 1GB, well you you don’t want to know what happens then.
and later:
Larger rows take longer to go through compaction, tend to cause more JVM GC and
have issue during repair. See the in_memory_compaction_limit_in_mb comments in
the yaml file. During repair we detect differences in ranges of rows and stream
them between the nodes. If you have wide rows and a single column is our of sync
we will create a new copy of that row on the node, which must then be compacted.
I’ve seen the load on nodes with very wide rows go down by 150GB just by
reducing the compaction settings.
IMHO all things been equal rows in the few 10’s of MB work better.
In a chat with Aaron Morton (last pickle) he indicated that billions of rows per table is not necessarily problematic.
Leaving this answer for reference, but not selecting it as "talked to a guy who knows a lot more than me" isn't particularly scientific.
Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.
I am doing some simple operations in Cassandra, to keep things simple I am using a single node . I have one single row and I add 10,000 columns to it, next I go and delete these 10,000 columns, after a while I add 10,000 more columns to it and then delete them after some time and so on ... The deletes will delete all the columns in that one row.
Here's the thing which I don't understand, even though I delete them I see the size of the database increase, my GCGracePeriod is set to 0 and I am using Leveled Compaction Strategy.
If I understand the tombstones correctly, they should be deleted after the first major compaction, it appears that they are not deleted, even after running nodetool compact command.
I read on some mailing list that these are rolling tombstones (if you frequently update and delete the same row) and are not handled by major compaction. So my question is when are they deleted ? if not then the data would just grow, which i personally think is bad. To make matters worst I could not find any documentation about this particular effect.
First, as you're discovering, this isn't a really good idea. At the very least you should use row-level deletes, not individual column deletes.
Second, There is no such thing as a major compaction with LCS; nodetool compact is a no-op.
Finally, Cassandra 1.2 improves compaction a lot for workloads that generate a lot of tombstones: https://issues.apache.org/jira/browse/CASSANDRA-3442