I'm using Datastax Cassandra Java Driver 2.1.0 to delete a set of rows in the database. My test environment is based on a single node with Cassandra 2.0.7.
I run the delete statement and then checked the result running a query to select the deleted rows.
The problem is that the second query returns the rows, but if I check it via cqlsh, the rows are indeed deleted.
The query trace reports that the rows are marked as tombstoned, so why the select query retrieves the data anyway?
Here is the code for the delete task:
Statement query = QueryBuilder.delete().from(QueryBuilder.quote(CF_MESSAGES))
.where(QueryBuilder.in(CF_MESSAGES_KEY, (Object[]) rowKeyArray));
session.execute(query);
And here the code for the select:
query = QueryBuilder.select().all().from(QueryBuilder.quote(CF_MESSAGES))
.where(QueryBuilder.in(CF_MESSAGES_KEY, (Object[]) rowKeyArray))
.and(QueryBuilder.lte(CF_MESSAGES_COLUMN1, "2:" + Character.MAX_VALUE));
ResultSet queryResult = session.execute(query);
Thank you!
Repair is an anti entropy mechanism that should be run ~weekly or at least more often than your gc_grace_seconds in order to avoid zombie tombstones from coming back. DataStax OpsCenter has a Repair Service that automates this task.
Manually you can run:
nodetool repair
in one node or
nodetool repair -pr
in each of your nodes. The -pr option will ensure you only repair a node's primary ranges.
You should try a nodetool repair... I had the same issue :
Cassandra and defuncting connection (see comments)
If your environment is a test environment try reducing the gc_grace_seconds and check if any node was down(you can check with Linux uptime command) when delete happened.
Related
I'm restoring to a deleted table data from a cassandra snapshot.
This is what I'm doing:
Logging to first node.
Taking all files from snapshot and copying them to the table data directory.
Executing
nodetool refresh <keyspace> <table>
and data is showed ok on that node, but not transmitted to others in cluster.
I'm aware the reason may be related to timestamps on records, so using advice I try deleting data on tables, executing
TRUNCATE <table>
on node previously to this process, but with the same results.
Have you tried a nodetool repair -full on the other nodes?
You will need to do the same refresh process on all the nodes, so the table should appear on all the nodes after that.
I know is a quick and dirty solution, but when I face this problem my solution was:
COPY Usuario (id , usuarioId, organizacionId, descripcion, estado , ultimoCambio, json , sesion) TO 'Usuario.csv';
COPY Usuario (id , usuarioId, organizacionId, descripcion, estado , ultimoCambio, json , sesion) FROM 'Usuario.csv';
And I prefer backup these csv than snapshots. Doing this process, rows are recreated and correctly copied to every node in cluster.
Currently we have two options to take data back up of the tables in a Cassandra keyspace. We can either user nodetool commands or use the copy command from the cqlsh terminal.
1) What are the differences between these commands ?
2) Which one is most appropriate ?
3) Also if we are using nodetool to take backup we would generally flush the data from mem tables to sstables before we issue the nodetool snapshot command. So my question is should we employ the same techinque of flushing the data if we use the cqlsh copy command ?
Any help is appreciated.
Thanks very much.
GREAT question!
1) What are the differences between these commands ?
Running a nodetool snapshot creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:
ln {source} {link}
A cqlsh COPY is essentially the same as doing a SELECT * FROM on a table. It'll create a text file with the table's data in whichever format you have specified.
In terms of their difference from a backup context, a file created using cqlsh COPY will contain data from all nodes. Whereas nodetool snapshot needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.
2) Which one is most appropriate ?
It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot is the way to go. If you're trying to export/import data into a new table or cluster, then COPY is the better approach.
Also worth noting, cqlsh COPY takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.
3) Should we employ the same technique of flushing the data if we use the cqlsh copy command ?
No, that's not necessary. As cqlsh COPY works just like a SELECT, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.
nodetool snapshot is good approach for any amount of data and it creates a hard link within seconds.copy command will take much time because and depends upon the size of data and cluster. for less data and testing you may use copy command but for production nodetool snapshot is recommended.
My OpsCenter give me 'Failed' result on Tombstone count performance service. I read this paper and find that may be the insertion of NULL value is the casual.
So I try to fix this problem using the following procedures:
Set the NULL column of table channels and articles to ''. And for checking reason, there is no any insertings to these two tables.
Set gc_grace_seconds to 0 using commands:
alter table channels with gc_grace_seconds = 0
alter table articles with gc_grace_seconds = 0
Truncate bestpractice_results table in OpsCenter keyspace.
Restart agents and OpsCenter using commands:
service datastax-agent restart
service opscenterd restart
But, when OpsCenter run routine performance check (every 1 minute), the following 'Failed' information appeared again. And the number of tombstones is not changed (i.e., 23552 and 1374)
And I have the question:
How to remove these tombstones when there is no any insertion operations on two tables ?
Do I need repair the cluster ?
OpsCenter Version: 6.0.3 Cassandra Version: 2.1.15.1423 DataStax Enterprise Version: 4.8.10
With Cassandra 3.10+, use
nodetool garbagecollect keyspace_name table_name
Check https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsGarbageCollect.html
Please go through below link to get complete info about Delete and Tombstone.. It may be helpful for you.
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
I've created a Table called 'test'.
CREATE TABLE test1(
link text PRIMARY KEY,
title text,
descp text,
pubdate text,
ts timestamp
);
Then I insert a record into:
INSERT INTO test1(title,link,descp,pubdate, ts) VALUES('T3','http://link.com/a3','D3','date3', toTimestamp(now())) IF NOT EXISTS;
This results in an error (red colored text in cqlsh): NoHostAvailable
The Cassandra setup uses Cassandra version 3.9 on Mac OS El Capitain.
The key space is this:
CREATE KEYSPACE testkeyspace
WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
This is configuration parameters I changed according to answers on stackoverflow:
start_rpc: true (from false)
start_native_transport: true (from false)
Still, I can't seem to pin-point why I can't run this statement of INSERT INTO using these keywords at the end of the insert statement "IF NOT EXISTS"
Note that I started Cassandra using "cassandra -f"
Please help if you know what's wrong here.
A possibility to try, many of the drivers now use LOCAL_ONE as a default consistency level. With SimpleStrategy, you can get cases where even with all the nodes being up this can request can fail (CASSANDRA-12053) if none of the nodes in your DC has the data. That should be exposed as an UnavailableException, not a NoHostAvailable so but its worth a try to use the network topology RF instead.
Is this a 1 node cluster? Having a replication factor > number of nodes will doubtless cause issues, so per comment setting to 1 is a good idea.
Was cassandra still running at time of the query? with -f you need to keep Cassandra running in foreground or cqlsh will lose its connection and give a NoHostAvailable exception.
I have a eight node cassandra setup. I am saving data with 3 days TTL. But the data is useless after I take a summary (using my java script, count of things etc). I want to delete all the data in a table. I can stop cassandra for sometime to do the deletion. So the data is removed from all nodes.
Should I run truncate and nodetool repair afterwards or should I flush first then delete. Whats the proper way to do it.
You can drop the tables or truncate them... but keep in mind that Cassandra will snapshot your tables by default, so you'll also need to run nodetool clearsnapshot on all of your nodes afterwards. There is no need to stop Cassandra while you do this delete.
I don't know that there is a right way per se... but when I do when I need to clear a table is, first, I run truncate on the table using cqlsh. Then I run nodetool clearsnapshot on my nodes using pssh (https://code.google.com/p/parallel-ssh/).
Hope this helps