Cassandra truncate performance - cassandra

I have been recently told that, cassandra truncate is not performant and it is anti pattern. But, I do not know why?
So, I have 2 questions:
Is it more performant to have upsert of all records then doing truncate?
Does truncate operation creates tombstones?
Cassandra Version: 3.x

From the cassandra docs:
Note: TRUNCATE sends a JMX command to all nodes, telling them to
delete SSTables that hold the data from the specified table. If any of
these nodes is down or doesn't respond, the command fails and outputs
a message like the following
So, running truncate will issue a deletion of all sstables belonging to your cassandra table, which will be quite fast but must be acknowledged by all nodes. Depending on your cassandra.yml this will snapshot your data before:
auto_snapshot (Default: true) Enable or disable whether a snapshot is
taken of the data before keyspace truncation or dropping of tables. To
prevent data loss, using the default setting is strongly advised. If
you set to false, you will lose data on truncation or drop.
When creating or modifying tables, you enable or disable the key cache
(partition key cache) or row cache for that table by setting the
caching parameter. Other row and key cache tuning and configuration
options are set at the global (node) level. Cassandra uses these
settings to automatically distribute memory for each table on the node
based on the overall workload and specific table usage. You can also
configure the save periods for these caches globally.
To your question:
upserts will be much slower (when there is significant data in your table)
truncate does not write tombstones at all (instead it will delete all on all nodes for your truncated table sstables immediately)

Related

cassandra deletes best practices

Looking to reclaim space on a large table. The table has old data which is no longer required and can be deleted. The deletes are based on partition key, there are about 500k partition keys to be deleted.
Would it be better to run the deletes in batches say 50k or 100k in one go? what might be a better batch size (batch here implying how many deletes can be run in one go)?
If the deletes are being run from cqlsh, will cqlsh act as client and connect to diff nodes as coordinator node for each delete or will the node from where cqlsh is started acts as co-ordinator node and all the deletes fired from there?
what are the best practices to run massive deletes/cleanups? any specific dos and donts?
First thing that you need to remember in Cassandra is that deletes really increase disk consumption, not decreasing it, until the compaction happens and old data is deleted. The Last Pickle has a great blog post on that topic.
Regarding your questions:
Batches on different partition keys are heavily increasing a pressure onto coordinator node, so they aren't recommended, especially such big. Prefer to delete one by one
cqlsh always sends commands to the same host (this is enforced by WhiteListPolicy) that acts as coordinator that then forwards traffic to node owning that data.
I would recommend to use external tool, either Spark + Spark Cassandra Connector, or you can use DSBulk to perform deletes as well, by using a custom query, something like this (assuming that you have CSV file with all values for partition column(s) that you want to delete - :pk the name of the column in the header of CSV file, and pk - name of partition column in your schema):
dsbulk load -query "DELETE FROM ks.table WHERE pk = :pk"
In this case DSBulk will correctly send data directly to nodes owning the data, avoiding the pressure on coordinator node.

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

Cassandra - What is difference between TTL at table and inserting data with TTL

I have a Cassandra 2.1 cluster where we insert data though Java with TTL as the requirement of persisting the data is 30 days.
But this causes problem as the files with old data with tombstones is kept on the disk. This results in disk space being occupied by data which is not required. Repairs take a lot of time to clear this data (upto 3 days on a single node)
Is there a better way to delete the data?
I have come across this on datastax
Cassandra allows you to set a default_time_to_live property for an entire table. Columns and rows marked with regular TTLs are processed as described above; but when a record exceeds the table-level TTL, Cassandra deletes it immediately, without tombstoning or compaction. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html?hl=tombstone
Will the data be deleted more efficiently if I set TTL at table level instead of setting each time while inserting.
Also, documentation is for Cassandra 3, so will I have to upgrade to newer version to get any benefits?
Setting default_time_to_live applies the default ttl to all rows and columns in your table - and if no individual ttl is set (and cassandra has correct ntp time on all nodes), cassandra can easily drop those data safely.
But keep some things in mind: your application is still able so set a specific ttl for a single row in your table - then normal processing will apply. On top, even if the data is ttled it won't get deleted immediately - sstables are still immutable, but tombstones will be dropped during compaction.
What could help you really a lot - just guessing - would be an appropriate compaction strategy:
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/dml/dmlHowDataMaintain.html#dmlHowDataMaintain__twcs-compaction
TimeWindowCompactionStrategy (TWCS)
Recommended for time series and expiring TTL workloads.
The TimeWindowCompactionStrategy (TWCS) is similar to DTCS with
simpler settings. TWCS groups SSTables using a series of time windows.
During compaction, TWCS applies STCS to uncompacted SSTables in the
most recent time window. At the end of a time window, TWCS compacts
all SSTables that fall into that time window into a single SSTable
based on the SSTable maximum timestamp. Once the major compaction for
a time window is completed, no further compaction of the data will
ever occur. The process starts over with the SSTables written in the
next time window.
This help a lot - when choosing your time windows correctly. All data in the last compacted sstable will have roughly equal ttl values (hint: don't do out-of-order inserts or manual ttls!). Cassandra keeps the youngest ttl value in the sstable metadata and when that time has passed cassandra simply deletes the entire table as all data is now obsolete. No need for compaction.
How do you run your repair? Incremental? Full? Reaper? How big in terms of nodes and data is your cluster?
The quick answer is yes. The way it is implemented is by deleting the SStable/s directly from disk. Deleting an SStable without the need to compact will clear up disk space faster. But you need to be sure that the all the data in a specific sstable is "older" than the globally configured TTL for the table.
This is the feature referred to in the paragraph you quoted. It was implemented for Cassandra 2.0 so it should be part of 2.1

Truncating the data create the snapshot in Cassandra consuming unnecessary space

I had a column family containing the space of 40GB. I truncated the column family. So, after GC_GRACE_SECONDS Cassandra created snapshots of the truncated data which is consuming the same amount of space. Is there any way by means of which we can get rid of space utilized by snapshots without disable the creation of snapshots. I mean isn't there any timeout parameter after which it will delete the snapshot consuming unnecessary space.
The snapshot that you are seeing getting created after truncating the column family is actually a safety mechanism in C* to avoid mass data loss in case of accidental table delete or truncation (btw it has nothing to do with gc_grace_seconds). There is a setting 'auto_snapshot' in cassandra.yaml which is true by default. From DataStax documentation
auto_snapshot
(Default: true ) Enable or disable whether a snapshot is taken of the data before keyspace truncation or dropping of tables. To prevent data loss, using the default setting is strongly advised. If you set to false, you will lose data on truncation or drop.
If you want to delete snapshots then you can use nodetool clearsnapshot command as explained here

Cassandra data directory does not get updated with deletion

Currently, I am bench marking Cassandra database using YCSB framework. During this time I have performed (batch) insertion and deletion of the data quite regularly.
I am using Truncate command to delete keyspace rows. However, I am noticing that my Cassandra data directory swells up as the experiments.
I have checked and can confirm that even there is no data in the keystore when I checked the size of data directory. Is there a way to initialize a process so that Cassandra automatically release the stored space, or does it happen over time.
When you use Truncate cassandra will create snapshots of your data.
To disable it you will have to set auto_snapshot: false in cassandra.yaml file.
If you are using Delete, then cassandra use tombstone,i.e your data will not get deleted immediately. Data will get deleted once compaction is ran.
To remove previous snapshots one can use nodetool snapshot command.

Resources