For teaching purposes I want to give insight in the replication strategy of a Cassandra cluster.
Therefore I would like to query the data in a specific Cassandra node. I did not find a way to do this? Does one of you know a way to do this?
If you find where the data resides in the cluster that you want to query, log into that node via a tool such as cqlsh, and then set your consistency to LOCAL_ONE, you should be able to get the data from the local node only. If you want to prove that to be the case, enable tracing before you run the query. It will tell you where it pulled the data from (you could also get some cases of read repair by chance (which will show other nodes as well). If you do, ignore that run and do it again).
To know about from which node data is coming I think you need to enable tracing on CQLSH.
cqlsh>TRACING ON
once tracing enabled if you run any query you will get tracing details and information. For more details you may refer below link.
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cqlsh_commands/cqlshTracing.html
Above things are based on replication and consistency level.
You can use https://github.com/tolbertam/sstable-tools to query with a cqlsh interface on a single SSTable without going through Cassandra; there is a good example at https://github.com/tolbertam/sstable-tools#cqlsh. You can obtain from there a list of keys stored on that SSTable, and then you can run the regular cqlsh with TRACING ON, as mentioned in another answer, and see if it will go to that server or another.
Or you could stop all servers but one, and try to run queries against it with LOCAL_ONE.
Related
We are using DSE Cassandra v4.8.9. I have a requirement where I need to find if the tables in Cassandra are being read using some metadata query or analyzing the logs.
We can definitely find from the application but for that we need to do code change, and we don't have cycles, so thought of considering the ways to get these details using features of DSE/Cassandra if that already available. Please advise.
You can obtain this information via JMX metrics - you need to look to the table metrics. You can obtain these metrics via JMX client, or use the nodetool cfstats command (it was renamed to tablestats in the later versions) - look to local read & write counts...
I need to provide similar utility functions such as is available through
nodetool tablestats
I've gone over their source code but didn't find a convenient solution to accessing it through code.
Is there a library available for this?
https://github.com/mariusae/cassandra/blob/master/src/java/org/apache/cassandra/tools/NodeProbe.java
The nodetool utility is connecting Cassandra via JMX and fetch all necessary data from corresponding beans. You can fetch necessary data from your program via JMX as well, but I wouldn't say that this is recommended way to do - it's better to setup some "standard" monitoring solution, like, Prometheus, connect it to Cassandra, and fetch data via it...
Offsite backups for Cassandra seem like a challenging thing. You basically have to make yet another copy of ALL your data, including the copies of data that exist due to the replication factor. Snapshots make backups easy when you don't mind storing it on the same disk that your node already uses. I'm curious - in the event of a catastrophic failure of this disk, is it possible to recover the node using the nodes that the data was replicated to?
Yes, you can restore data on crashed node using a procedure in documentation - Replacing a dead node or dead seed node. It's for Cassandra 3.x, please pick your Cassandra version from a drop-down menu on the top of the page.
But please note that you still need to do backups if your data is valuable. If you using AWS you can use this project to backup Cassandra to S3 storage.
If you are looking for offsite or off-host backups, you can also look at opscenter from Datastax or Talena software (my company). Both provide you the ability to backup your database locally or to S3. As you may expect, you also have the ability to restore data in case of hardware failures, user errors or logical corruptions which the replicas will not protect you against.
Yes, it is possible. Just execute in terminal "nodetool repair" on the node with missed data. It can take a lot of time. Also I would recommend execute repair operation on each node every month to keep your data always replicated because cassandra does not repairs data automatically (for example after node(s) falling).
I am trying to get all inserts, updates, deletes to a normalized DB2 database (hosted on an IBM Mainframe) synced to a Cassandra database. I also need to denormalize these changes before I write them to Cassandra so that the data structure meets my Cassandra model.
Searched on google but tools either lack processing support or streaming CDC support.
Is there any tool out there that can help me achieve the above?
It's likely that no stock tool exists. What's the format of the CDC stream coming out? What queries do you need to run? Like any other Cassandra data modeling question, start with the queries you need to run and work backwards to the table structure(s).
I'm new in Cassandra DB, and I have a very trivial question: how much parallel queries can O do without compromising perfomance? The queries are going to be like
Select data from table where id='asdasdasd';
Its a server in a datacenter, it should work properly with 3000 read querys? Sorry for the poor information but its all i have.
It all depends on the server's capacity where you have installed your cluster of Cassandra, and how you have configured the nodes.
There is a configuration parameter in cassandra.yaml that is concurrent_reads
Tune it to get a better read rate.