How Cassandra 4.0 virtual table reads data?

How Cassandra 4.0 virtual table reads data? - cassandra

From Documentation It is cleared that in cassandra 4.0 virtual tables are read only and no writes allowed.
Currently there are 2 vtables available i.e system_views and system_virtual_schema and it contains 17 tables.
This contains data like clients,cache ,settings etc.
Where this data is exactly coming from in vtables?
Here are all vtables: https://github.com/apache/cassandra/tree/64b338cbbce6bba70bda696250f3ccf4931b2808/src/java/org/apache/cassandra/db/virtual
PS: I have gone through cassandra.yaml
Reference : https://cassandra.apache.org/doc/latest/new/virtualtables.html

The virtual tables store metrics data that was previously only available via JMX but now also available via CQL.
For example, the system_view.clients table tracks metadata on client connection including (but not limited to):
the remote IP address of the client
logged in user (if auth is enabled)
protocol version
driver name & version
whether SSL is used, etc
This data is available via JMX and nodetool clientstats, and is now retrievable via CQL (I wrote about this in https://community.datastax.com/questions/6113/). Cheers!

Related

Which node will respond to "SELECT * FROM system.local" using the Cassandra Java driver?

I am trying to write some synchronization code for a java app that runs on each of the cassandra servers in our cluster (so each server has 1 cassandra instance + our app). For this I wanted to make a method that will return the 'local' cassandra node, using the java driver.
Every process creates a cqlSession using the local address as contactPoint. The driver will figure out the rest of the cluster from that. But my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Is there a way in the Java driver to determine which of the x nodes the process its running on?
I tried this code:
public static Node getLocalNode(CqlSession cqlSession) {
Metadata metadata = cqlSession.getMetadata();
Map<UUID, Node> allNodes = metadata.getNodes();
Row row = cqlSession.execute("SELECT host_id FROM system.local").one();
UUID localUUID = row.getUuid("host_id");
Node localNode = null;
for (Node node : allNodes.values()) {
if (node.getHostId().equals(localUUID)) {
localNode = node;
break;
}
}
return localNode;
}
But it seems to return random nodes - which makes sense if it just sends the query to one of the nodes in the cluster. I was hoping to find a way without providing hardcoded configuration to determine what node the app is running on.

my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Correct. When running a query where token range ownership cannot be determined, a coordinator is "selected." There is a random component to that selection. But it does take things like network distance and resource utilization into account.
I'm going to advise reading the driver documentation on Load Balancing. This does a great job of explaining how the load balancing policies work with the newer drivers (>= 4.10).
In that doc you will find that query routing plans:
are different for each query, in order to balance the load across the cluster;
only contain nodes that are known to be able to process queries, i.e. neither ignored nor down;
favor local nodes over remote ones.
As far as being able to tell which apps are connected to which nodes, try using the execution information returned by the result set. You should be able to get the coordinator's endpoint and hostId that way.
ResultSet rs = session.execute("select host_id from system.local");
Row row = rs.one();
System.out.println(row.getUuid("host_id"));
System.out.println();
System.out.println(rs.getExecutionInfo().getCoordinator());
Output:
9788de64-08ee-4ab6-86a6-fdf387a9e4a2
Node(endPoint=/127.0.0.1:9042, hostId=9788de64-08ee-4ab6-86a6-fdf387a9e4a2, hashCode=2625653a)

You are correct. The Java driver connects to random nodes by design.
The Cassandra drivers (including the Java driver) are configured with a load-balancing policy (LBP) which determine which nodes the driver contacts and in which order when it runs a query against the cluster.
In your case, you didn't configure a load-balancing policy so it defaults to the DefaultLoadBalancingPolicy. The default policy calculates a query plan (list of nodes to contact) for every single query so each plan is different across queries.
The default policy gets a list of available nodes (down or unresponsive nodes are not included in the query plan) that will "prioritise" query replicas (replicas which own the data) in the local DC over non-replicas meaning replicas will be contacted as coordinators before other nodes. If there are 2 or more replicas available, they are ordered based on "healthiest" first. Also, the list in the query plan are shuffled around for randomness so the driver avoids contacting the same node(s) all the time.
Hopefully this clarifies why your app doesn't always hit the "local" node. For more details on how it works, see Load balancing with the Java driver.
I gather from your post that you want to circumvent the built-in load-balancing behaviour of the driver. It seems like you have a very edge case that I haven't come across and I'm not sure what outcome you're after. If you tell us what problem you are trying to solve, we might be able to provide a better answer. Cheers!

How to change property key in gremlin with graph engine as Janusgraph and storage backend as cassandra?

I am using cassandra 4.0 with Janusgraph 6.0, I have n nodes with Label as "januslabel" and property as "janusproperty", I want to change property name to "myproperty", I have tried the answer of this link, Rename property with Gremlin in Azure Cosmos DB
but I was not able to do this permanently, what I mean with permanently is that whenever I do restart cassandra or janusgraph I get the old property name "janusproperty".
How can I change this permanently?

When using JanusGraph, if no transaction is currently open, one will be automatically started once a Gremlin query is issued. Subsequent queries are also part of that transaction. Transactions need to be explicitly committed for any changes to be persisted using something like graph.tx().commit(). Transactions that are not committed will eventually time out and changes will be lost.

TokenAware Policy in Cassandra

When using token aware policy as Load Balancing Policy in Cassandra do all the queries are automatically send over the correct node (which contains the replica eg select * from Table where partionkey = something, will automatically get the hash and go to the correct replica) or I have to use token() function with all my queries ?

That is correct. The TokenAwarePolicy will allow the driver to prefer a replica for the given partition key as the coordinator for the request if possible.
Additional information about load balancing with the Java driver is available on the LoadBalancingPolicy API page.
Specifically, the API documentation for TokenAwarePolicy is available here.

How to handle read/write request in cassandra

I have 5 node cluster with 2 Cassandra,2 solr and 1 hadoop on EC2 with DSE4.5.
My requirement is I dont want to hard code node IP address while requesting for Reading/writing from Cluster. I have to develop web service, thru which requester can send read/write request to my cluster and web service has to determine following
1) route read request to appropriate node.
2) route write request to appropriate node.
If there is any write request then it should direct to Cassandra node on basis of keyspace and replication factor. if it is a read request then request should route to Solr node (as I done indexing on solr) and if there is any analytic query then request should route to hadoop.
And if any node goes down in that case response will not affect.
Apart from dedicated request, is there any way to request a cluster ?
by dedicated mean giving specific IP address for read and write.
Is any method or algorithm exist in DSE? or Is there any tool available in for this?

The Java driver should take care of all of that for you:
http://www.datastax.com/documentation/developer/java-driver/2.0/common/drivers/introduction/introArchOverview_c.html
For example:
Nodes discovery: the driver automatically discovers and uses all nodes of the Cassandra cluster, including newly bootstrapped ones
Configurable load balancing: the driver allows for custom routing and load balancing of queries to Cassandra nodes. Out of the box, round robin is provided with optional data-center awareness (only nodes from the local data-center are queried (and have connections maintained to)) and optional token awareness (that is, the ability to prefer a replica for the query as coordinator).
Transparent failover: if Cassandra nodes fail or become unreachable, the driver automatically and transparently tries other nodes and schedules reconnection to the dead nodes in the background.
On the Solr query side, you can use the SolrJ load balancer, but you have to hard-wire the list of nodes to be used as coordinator nodes, but SolrJ will round robin for you.

Retrieve ring tokens via thrift or CQL api

Is it possible to retrieve token-to-node assignment information (aka the ring state) via thrift or CQL api. I am looking for output similar to what nodetool ring command returns? I need that to optimize a client application a bit so that it goes directly to the node that contains the requested data hereby saving one network hop.

The thrift interface has the method describe_ring that gives you back this information.
In CQL this information is in the system.peers table:
select * from system.peers;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string