I have few questions about Cassandra tombstones and manual compaction.
Let's say that I delete a row (partition key) in my Cassandra cluster at time X. Let's assume that gc_grace_seconds has its default value (ten days).
Is it true that if manually start a nodetool compact at a time lower than X+10 days the old data will be still on disk after the compaction ?
Instead, if I start nodetool compact at a time higher than X+10 days the old data is really removed from disk ?
Let's assume that the delete was issued at time X and later on I change the gc_grace_seconds to a lower value (let's say 1 day). If at time X+2 days I start nodetool compact the old data will be really removed from disk ? In other words the tombstone, when created, contains the deletion time and not the expiration time, right ?
Is it true that if manually start a nodetool compact at a time lower than X+10 days the old data will be still on disk after the compaction ?
Yes tombstones are not removed if compaction is run before gc_grace_seconds.
Instead, if I start nodetool compact at a time higher than X+10 days the old data is really removed from disk ?
Generally yes but depends on compaction strategy also. So you cannot be 100% sure of this.
Let's assume that the delete was issued at time X and later on I change the gc_grace_seconds to a lower value (let's say 1 day). If at time X+2 days I start nodetool compact the old data will be really removed from disk ? In other words the tombstone, when created, contains the deletion time and not the expiration time, right ?
Yes you are correct on this. Tombstones contains deletion time. Expiry depends on gc_grace_seconds value of the table.
You should generally not run nodetool compact command (major compcations) and your compactions should be running automatically (minor compactions).
Related
I have a Cassandra table with TTL of 60 seconds, I have few questions in this,
1) I am getting the following warning
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM xx.yy WHERE token(y) >= token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see tombstone_warn_threshold)
What does this mean?
2) As per my study, Tombstone is a flag in case of TTL (will be deleted after gc_grace_seconds)
i) so till 10 days does it mean that it won't be deleted ?
ii) What will be the consequence of it waiting for 10 days?
iii) Why it is a long time 10 days?
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
gc_grace_seconds 864000 [10 days] The number of seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstoned record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion. For details about decreasing this value, see garbage collection below.
3) I read that performing compaction and repair using nodetool will delete the tombstone, How frequently we need to run this in background, What will be the consequence of it?
This means that your query returned 76 "live" or non-deleted/non-obsoleted rows of data, and that it had to sift through 1324 tombstones (deletion markers) to accomplish that.
In the world of distributed databases, deletes are hard. After all, if you delete a piece of data from one node, and you expect that deletion to happen on all of your nodes, how would you know if it worked? Quite literally, how do you replicate nothing? Tombstones (delete markers) are the answer to that question.
i. The data is gone (obsoleted, rather). The tombstone(s) will remain for gc_grace_seconds.
ii. The "consequence" is that you'll have to put up with those tombstone warning messages for that duration, or find a way to run your query without having to scan over the tombstones.
iii. The idea behind the 10 days, is that if the tombstones are collected too early, that your deleted data will "ghost" its way back up to some nodes. 10 days gives you enough time to run a weekly repair, which ensures your tombstones are properly replicated before removal.
Compaction removes tombstones. Repair replicates them. You should run repair once per week. While you can run compaction on-demand, don't. Cassandra has its own thresholds (based on number and size of SSTable files) to figure out when to run compaction, and it's best not to get in its way. If you do, you'll be manually running compaction from there on out, as you'll probably never reach the compaction conditions organically.
The consequences, are that both repair and compaction take compute resources, and can reduce a node's ability to serve requests. But they need to happen. You want them to happen. If compaction doesn't run, your SSTable files will grow in number and size; eventually causing rows to exist over multiple files, and queries for them will get slow. If repair doesn't run, your data is at risk of not being in-sync.
I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.
Recently I have been trying to familiarize myself with Cassandra but don't quite understand when data is removed from disk after it has been deleted. The use case I'm particularly interested is expiring time series data with DTCS. As an example, consider the following table:
CREATE TABLE metrics (
metric_id text,
time timestamp,
value double,
PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND
default_time_to_live = 86400 AND
gc_grace_seconds = 3600 AND
compaction = {
'class': 'DateTieredCompactionStrategy',
'timestamp_resolution':'MICROSECONDS',
'base_time_seconds':'3600',
'max_sstable_age_days':'365',
'min_threshold':'4'
};
I understand that Cassandra will create a tombstone for all rows inserted into this table after 24 hours (86400 seconds). These tombstones will first be written to an in-memory Memtable and then flushed to disk as an SSTable when the Memtable reaches a certain size. My question is when will the data that is now expired be removed from disk? Is it the next time the SSTable which contains the data gets compacted? So, with DTCS and min_threshold set to four, we would wait until at least three other SSTables are in the same time window as the expired data, and then those SSTables will be compacted into a SSTable. Is it during this compaction that the data will be removed? It seems to me that this would require Cassandra to maintain some metadata on which rows have been deleted since the newer tombstones would likely not be in the older SSTables that are being compacted.
Alternatively, do the SSTables which contain the tombstones have to be compacted with the SSTables which contain the expired data for the data to be removed? It seems to me that this could result in Cassandra holding the expired data long after it has expired since it's waiting for the new tombstones to be compacted with the older expired data.
Finally, I was also unsure when the tombstones themselves are removed. I know Cassandra does not delete them until after gc_grace_seconds but it can't delete the tombstones until it's sure the expired data has been deleted right? Otherwise it would see the expired data as being valid. Consequently, it seems to me that the question of when tombstones are deleted is intimately tied to the questions above. Thanks!
If it helps I've been experimenting with version 2.0.15 myself.
There's two ways to definitly remove data in Cassandra.
1 : When gc_grace_seconds expires. In your table, gc_grace_seconds is set to 3600. wich means that when you execute a delete statement on a row. You will have to wait 3600 seconds before the data is definitly removed from all the cluster.
2 : When a compaction comes in. During a compaction, Cassandra looks for all the data marked with a tombstone and simply ignores it when writing the new SSTable to ensure that the new SSTable doesn't have already deleted data.
However, it might happen that a node goes down longer than gc_grace_seconds or during a compaction, you'll find more information in the Cassandra documentation.
After some further research and help from others I've realized that I had some misconceptions in my original questions. Specifically: "Data deleted by TTL isn’t the same as issuing a delete – each expiring cell internally has a ttl/timestamp at which it will be converted into a tombstone. There is no tombstone added to the memtable, or flushed to disk – it just treats the expired cells as tombstones once they’re past that timestamp."
Furthermore, Cassandra will check if it can drop SSTables containing only expired data when a memtable is flushed to disk and a minor compaction runs, no more than once every ten minutes though (see this issue). Hope that helps if you had the same questions as me!
I am always inserting data PRIMARY KEY ((site_name,date),time,id) while the site_name and date can be same the time which is a tamed field and id(uuid) is different. So I always add new data. Data is inserted with TTL (Currently 3 days). So as i don't delete or update can I disable compaction? Considering TTL is there. Would it effect anything. Also as no record is deleted can i disable gc_grace time? I wanna put as much less load on the servers as possible. Much appreciate if anyone can help ?
TTLs create tombstones. As such, compaction is required. If your data is time series data, you might consider the new date tiered compaction: http://www.datastax.com/dev/blog/datetieredcompactionstrategy .
If you use TTLs and set grace to 0, you're asking for trouble unless your cluster is a single node one. the grace is the amount of time to wait before collecting tombstones. If it's 0, it won't wait. This may sound good, but in reality, it'll mean the "deletion" might not propagate across the cluster, and the deleted data may re-appear (coz other nodes may have it, and the last present value will "win"). This type of data is called zombie data. Zombies are bad. Don't feed the zombies.
You can disable auto compaction: http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsDisableAutoCompaction.html . But again, I doubt you'll gain much from this. Again, look at date tiered compaction.
you can permanently disable autocompaction on tables (column families) separately, like this (cql)
alter table <tablename> with compaction = { 'class':'CompactionStrategy', 'enabled':'false'}
the enabled:false permanently disables autocompaction on that table, but you can do manual compaction whenever you like using 'nodetool compact' command
You can set gc grace to 0, but not turn off compaction. If you never delete or update I think you might be able to turn off compaction.
Edit:
Optimizations in C* from 2.0 and onwards exactly for this case:
https://issues.apache.org/jira/browse/CASSANDRA-4917
About TTL, tombstones and GC Grace
http://mail-archives.apache.org/mod_mbox/cassandra-user/201307.mbox/%3CCALY91SNy=cxvcGJh6Cp171GnyXv+EURe4uadso1Kgb4AyFo09g#mail.gmail.com%3E
In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period. What happens to the data? I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period.
True.
What happens to the data?
The data will remain on disk at least for gc_grace_seconds. Next minor compaction right after gc_grace_seconds may remove it, but real timing depends mostly on your dataset and workload type.
I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
If you want to free some disk space, you can:
wait for gc_grace_seconds for normal minor compaction.
run nodetool compact which will trigger major compaction on current node freeing disk space right now.