Access cassandra nodetool API progmatically - cassandra

I need to provide similar utility functions such as is available through
nodetool tablestats
I've gone over their source code but didn't find a convenient solution to accessing it through code.
Is there a library available for this?
https://github.com/mariusae/cassandra/blob/master/src/java/org/apache/cassandra/tools/NodeProbe.java

The nodetool utility is connecting Cassandra via JMX and fetch all necessary data from corresponding beans. You can fetch necessary data from your program via JMX as well, but I wouldn't say that this is recommended way to do - it's better to setup some "standard" monitoring solution, like, Prometheus, connect it to Cassandra, and fetch data via it...

Related

how to find if tables are being read usging metadata query/logs in cassandra

We are using DSE Cassandra v4.8.9. I have a requirement where I need to find if the tables in Cassandra are being read using some metadata query or analyzing the logs.
We can definitely find from the application but for that we need to do code change, and we don't have cycles, so thought of considering the ways to get these details using features of DSE/Cassandra if that already available. Please advise.
You can obtain this information via JMX metrics - you need to look to the table metrics. You can obtain these metrics via JMX client, or use the nodetool cfstats command (it was renamed to tablestats in the later versions) - look to local read & write counts...

Cassandra: query data on specific node

For teaching purposes I want to give insight in the replication strategy of a Cassandra cluster.
Therefore I would like to query the data in a specific Cassandra node. I did not find a way to do this? Does one of you know a way to do this?
If you find where the data resides in the cluster that you want to query, log into that node via a tool such as cqlsh, and then set your consistency to LOCAL_ONE, you should be able to get the data from the local node only. If you want to prove that to be the case, enable tracing before you run the query. It will tell you where it pulled the data from (you could also get some cases of read repair by chance (which will show other nodes as well). If you do, ignore that run and do it again).
To know about from which node data is coming I think you need to enable tracing on CQLSH.
cqlsh>TRACING ON
once tracing enabled if you run any query you will get tracing details and information. For more details you may refer below link.
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cqlsh_commands/cqlshTracing.html
Above things are based on replication and consistency level.
You can use https://github.com/tolbertam/sstable-tools to query with a cqlsh interface on a single SSTable without going through Cassandra; there is a good example at https://github.com/tolbertam/sstable-tools#cqlsh. You can obtain from there a list of keys stored on that SSTable, and then you can run the regular cqlsh with TRACING ON, as mentioned in another answer, and see if it will go to that server or another.
Or you could stop all servers but one, and try to run queries against it with LOCAL_ONE.

Is it right to access external cache in apache spark applications?

We have many micro-services(java) and data is being written to hazelcast cache for better performance. Now the same data needs to be made available to Spark application for data analysis. I am not sure If this is right design approach to access external cache in apache spark. I cannot make database calls to get the data as there will be many database hits which might affect micro-services(currently we dont have http caching).
I thought about pushing the latest data into Kafka and read the same in spark. However, data(each message) might be big(> 1 MB sometimes) which is not right.
If its ok to use external cache in apache spark, is it better to use hazelcast client or to read Hazelcast cached data over rest service ?
Also, please let me know If there are any other recommended way of sharing data between Apache Spark and micro-services
Please let me know your thoughts. Thanks in advance.

Cassandra JMX need to connect all the nodes

I am trying to get to know Cassandra cfstats information from all the machines using JMX. This can be done using OpsCenter, but I do not want to use it. I started building my own utility. For now, my java program connects to JMX and fetching cfstats information such as estimateKeys, No of SSTables ..etc.
My requirement is, This is a java jar file, will run from one Cassandra node and should be able to connect to all the machines and fetch cfstats using their respective JMX per node.
I am planning to use java driver for this, as java driver will be able to connects all the machines in the cluster using system.peers coumnFamily. Once java driver connect to the machines, I will form the service:jmx:rmi using respective hostname and JMX port(7199). Then I will be able to connect to NodeProbe using this information.
My question is, after connecting to the another node using java driver, will I be able to retain state there and after forming service:jmx:rmi url, will this url really connects to the current node JMX and pull cfstats information from the current node. Because JMX host name it will take from the Cassandra-env.sh file. Can some one please help me in this.
Does this idea works or is there another best way to achieve this?
It's possible to use JMX remotely, but that's not the easiest thing to do.
But if you are writing your tool - maybe it's worth to check out a different connection. E.g. you can easily convert JMX calls to HTTP using Jolokia

How to calculate objects' size in Cassandra programatically

I am new to Cassandra and was wondering how to calculate Cassandra cache size programmatically.
For example, after inserting several objects into Cassandra, I want to know what volume has been taken by those objects in Cassandra's memory table, via code.
The Cfstats are command-line tools, which does not meet my requirement.
Is there anything in the Hector API that can help? Thanks.
The CLI tools actually use JMX to interrogate the Cassandra instance(s). You could use this approach programatically, but it would be cumbersome. This page has some details on the monitoring interface:
http://www.datastax.com/docs/1.0/operations/monitoring
There is no other API support for retrieval of cache statistics information.

Resources