Tombstone scanning in system.log - cassandra

I have a cassandra cluster with less delete use case. I found in my system.log "Read 10 live and 5645464 tombstones cells in keyspace.table" What does it mean? please help to understand.
Thanks.

For Cassandra, all the information recorded is immutable. This means that when you have a delete operation (explicit with a delete statement or with a Time To Live [TTL] clause), the database will add another record with a special flag named tombstone. All these records will stay on the database until the gc_grace_seconds periods have passed; the default is 10 days.
In your case, the engine found out that most of the records retrieved were deleted, but they are still waiting for the gc_grace_seconds to pass, to let compaction reclaim the space. One possible option to fix the issue is to decrease gc_grace_seconds for that table.
For more information, please refer to this article from the Last Pickle.

One more important thing to keep in mind when working with Cassandra is that tombstones cells do not directly correlate to deletes.
When you insert null value to an attribute when performing your insert, Cassandra internally marks that attribute/cell as a tombstone. So, even if you don't have a lot of deletes happening, you could end up with an enormous number of tombstones. Easy and simple solution is to not insert null values for an attribute while inserting.
As per this statement Read 10 live and 5645464 tombstones cells in keyspace.table goes, there might be a table scan for a query happening that is scanning 10 cells and 5645464 number of tombstones (cells with null value) while doing so is what I am guessing. Need to understand what type of queries are being executed to gain more insight into that.

Related

Does Cassandra store only the affected columns when updating a record or does it store all columns every time it is updated?

If the answer is yes,
Does that mean unlike Mongo or RDMS, whether we retrieve every column or some column will have big performance impact in Cassandra?(I am not talking about transfer time over network as it will affect all of the above)
Does that mean during compaction, it cannot just stop when it finds the latest row for a primary key, it has to go through the full set in SSTables? (I understand there will be optimisations as previously compacted SSTable will have maximum one occurrence for row)
Please ask only one question per question.
That is entirely up to you. If you write one column value, it'll persist just that one. If you write them all, they will all persist, even if they are the same as the current value.
whether we retrieve every column or some column will have big performance impact
This is definitely the case. Queries for column values that are small or haven't been written to or deleted will be much faster than the opposite.
during compaction, it cannot just stop when it finds the latest row for a primary key, it has to go through the full set in SSTables?
Yes. And not just during compaction, but read queries will also check multiple SSTable files.

Deleting column in cassandra for large dataset

We have a redundant column that we'd like to delete from our Cassandra database (version 2.1.15). This is a text column represents the majority of data on disk (15 nodes X 1.8 TB per node).
The easiest option just seems to be an alter table to remove that column, and then let Cassandra compaction take care of things (also running Cassandra Reaper to manage repairs). However, given the size of the dataset I'm concerned I will knock over the cluster with a massive delete.
Other options I've consider is a process that will run through the keyspace setting the value to null, but I think this will have the same effect as removing the column, but is more under out control (but also requires writing something to do this).
Would anyone have any advice on how to approach this?
Thanks!
Dropping a column does mark the deleted values as tombstones. The column value becomes unavailable immediately and the column data is removed in the next compaction cycle.
If you want to to expedite the removal of the column before the compaction occurs, you can run nodetool upgradesstables to remove the data, after you use the ALTER TABLE command to change the metadata for the column.
See Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/alter_table_r.html
If I remember correctly, drop of column doesn't really mark the deleted values with tombstone, but instead inserts corresponding entry into system.dropped_columns table, and then code, like, SerializationHelper & BTreeRow, performs filtering on the fly. The data will be deleted when compaction will happen.
Explicitly setting the value to null won't make situation better because you'll add data to the table.
I would recommend to test deletion on small cluster & check how it behaves.

Will Cassandra return tombstone rows as valid rows?

My idea is to add rows to Cassandra with TTL = 15 minutes so I'll be able to load realtime data (now - 15 minutes, now) w/o storing timestamps etc. My concern is that the rows with expiring TTL will be marked as tombstone (not actually deleted). I.e., will they count when I run select count(*) from realtime_table?
No, tombstoned rows won't be returned as a result - they will be skipped when reading the data.
But if you actively expiring the data, you may need to tune gc_grace_period, otherwise you can get too many not removed tombstones, and in some cases will start to get warning or error during read if read operation will need to skip tombstones (controlled by tombstone_warn_threshold & by tombstone_failure_threshold options of cassandra.yaml.
Here is the very good blog post that describes how data are deleted & cleaned up.
But select count(*) from table is real antipattern in Cassandra - you need to consider correct modelling of your data with partitions, etc.

Cassandra Query Timeout with small set of data

I am having a problem with Cassandra 2.1.17. I have a table with about 40k "rows" in it. One partition I am having a problem with has maybe about 5k entries in it.
Table is:
create table billing (
accountid uuid,
date timeuuid,
credit double,
debit double,
type text,
primary key (accountid,date)
) with clustering order by (date desc)
So there is a lot of inserting and deleting from this table.
My problem is that somehow it seems to get corrupt I think because I am no longer able to select data past a certain point from a partition.
From cqlsh I can run soemthing like this.
SELECT accoutid,date,credit,debit,type FROM billing WHERE accountid=XXXXX-xxxx-xxxx-xxxxx... AND date < 3d466d80-189c-11e7-8a57-f33cbced2fc5 limit 2;
First I did a select limit of 10000 it works up to around 5000 rows pageing through them then towards the end it will give a timeout error.
I then use the second from last timeuuid and select limit 2 it will fail limit 1 will work.
If I use the last timeuuid as a < and limit to 1 it will also fail.
So just looking for what I can do here I am not sure what is wrong and not sure how I can fix/diagnose what happened.
I have tired a repair and force a compaction. but it still seems to have the issue.
Thank you for any help.
Try to start with running manual compaction on your table.
You can increase read_request_timeout_in_ms parameter in cassandra config.
Consider moving to leveled compaction strategy if you are having a lot of deletes and updates.
I think you got too many tombstones in this partition.
What is a tombstone ?
To remember that a record has been deleted Cassandra creates a special value called a "tombstone". A tombstone has a TTL as any other value has but it is not compacted as easily as any other value is. Cassandra keeps it longer to avoid such inconsistency as data reappearence.
How to watch tombstones ?
nodetool cfstats gives you an idea of how many tombstones you have on average per slice
How to fix the issue ?
The duration a tombstone is preserved is gc_grace_seconds. You have to reduce it and then run a major compaction to fix the issue.
It looks to me like you are hitting a lot of tombstones when you do selects. The thing is while they are there cassandra still has to go over them. There might be multiple factors like ttl with insert statements, a lot of deletes, inserting of nulls etc.
My bet would be that you would need to adjust gc_grace_seconds on table and run repairs more often. But be careful and don't set it to to low (one round of repair has to finish before this time).
It's all nicely explained here:
https://opencredo.com/cassandra-tombstones-common-issues/

Why don't an upsert create Tombstones in Cassandra?

As per Question regarding Tombstone, why doesn't upserts create tombstones?
As per datastax documentation, How is data updated ? for every upsert, cassandra considers as delete followed by insert, as the new timestamps of the insert overwrites the old timestamp. The old timestamp data has to be marked as delete which relates to tombstone.
Why do we have contradicting statements? or else am I missing anything here?
Usecase:
Data is inserted with unique key (uuid) in Cassandra and some of the columns in this data keeps updating frequently. Which approach do you recommend?
Inserting the same data with new column values in the
Insert query.
Updating the existing record based on given uuid
with new column values in the update query.
Which approach does or doesn't create tombstones? and how does Cassandra handle both queries?
As Russ pointed out, you may want to read other similar questions on this topic. However,
An upsert/overwrite is just-another-cell, with a name, a timestamp and a value.
A tombstone is just like an overwrite, except it gets one extra field indicating that it's been deleted, so that it isn't returned as valid output. The reason tombstones are often harmful is that they can accumulate in bad data models, even when people think the data is gone - and skipping them to get to live data actually requires memory.
When you update/upsert as you describe, the cell you create SHADOWS (obsoletes) the previous cell, which will be removed upon compaction. That previous cell is NOT a tombstone, even though it's no longer live/active - it will be compacted away and completely replaced by the new, live, highest-timestamp value as soon as compaction allows.
The biggest thing to keep in mind is this: tombstones aren't necessarily removed by compaction - they're kept around (persisted/rewritten) for at least gc_grace_seconds, and potentially even long if they need to shadow/cover other cells in sstables not-yet-compacted. Because of this, tombstones stay around for a long time, but shadowed/overwritten cells are gc'd as soon as the sstable they're in is compacted.

Resources