Persistent storage in JanusGraph using Cassandra

Persistent storage in JanusGraph using Cassandra - cassandra

I'm playing with JanusGraph and Cassandra backend but I have some doubts.
I have a Cassandra server running on my machine (using Docker) and in my API I have this code:
GraphTraversalSource g = JanusGraphFactory.build()
.set("storage.backend", "cql")
.set("storage.hostname", "localhost")
.open()
.traversal();
Then, through my API, I'm saving and fetching data using Gremlin. It works fine, and I see data saved in Cassandra database.
The problem comes when I restart my API and try to fetch data. Data is still stored in Cassandra but JanusGraph query returns empty. Why?
Do I need to load backend storage data into memory or something like that? I'm trying to understand how it works.
EDIT
This is how I add an item:
Vertex vertex = g.addV("User")
.property("username", username)
.property("email", email)
.next();
And to fetch all:
List<Vertex> all = g.V().toList()

Commit your Transactions
You are using JanusGraph right now embedded as a library in your application which gives you access to the full API of JanusGraph. This means that you have to manage transactions on your own which also includes the necessity to commit your transactions in order to persist your modifications to the graph.
You can simply do this by calling:
g.tx().commit();
after you have iterated your traversal with the modifications (the addV() traversal in your case).
Without the commit, the changes are only available locally in your transaction. When you restart your Docker container(s), all data will be lost as you haven't committed it.
The Recommended Approach: Connecting via Remote
If you don't have a good reason to embed JanusGraph as a library in your JVM application, then it's recommended to deploy it independently as JanusGraph Server to which you can send your traversals for execution.
This has the benefit that you can scale JanusGraph independently of your application and also that you can use it from non-JVM languages.
JanusGraph Server then also manages transactions for you transparently by executing each traversal in its own transaction. If the traversal succeeds, then the results are committed and they are also rolled back automatically if an exception occurs.
The JanusGraph docs contain a section about how to connect to JanusGraph Server from Java but the important part is this code to create a graph traversal source g connected to your JanusGraph Server(s):
Graph graph = EmptyGraph.instance();
GraphTraversalSource g = graph.traversal().withRemote("conf/remote-graph.properties");
You can start JanusGraph Server of course also as a Docker container:
docker run --rm janusgraph/janusgraph:latest
More information about the JanusGraph Docker image and how it can be configured to connect to your Cassandra backend can be found here.
The part below is not directly relevant for this question any more given the comments to my first version of the answer. I am still leaving it here in case that others have a similar problem where this could actually be the cause.
Persistent Storage with Docker Containers
JanusGraph stores the data in your storage backend which is Cassandra in your case. That means that you have to ensure that Cassandra persists the data. If you start Cassandra in a Docker container, then you have to mount a volume where Cassandra stores the data to persist it beyond restarts of the container.
Otherwise, the data will be lost once you stop the Cassandra container.
To do this, you can start the Cassandra container for example like this:
docker run -v /my/own/datadir:/var/lib/cassandra -d cassandra
where /my/own/datadir is the directory of your host system where you want the Cassandra data to be stored.
This is explained in the docs of the official Cassandra Docker image under Caveats > Where to Store Data.

Related

Which node will respond to "SELECT * FROM system.local" using the Cassandra Java driver?

I am trying to write some synchronization code for a java app that runs on each of the cassandra servers in our cluster (so each server has 1 cassandra instance + our app). For this I wanted to make a method that will return the 'local' cassandra node, using the java driver.
Every process creates a cqlSession using the local address as contactPoint. The driver will figure out the rest of the cluster from that. But my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Is there a way in the Java driver to determine which of the x nodes the process its running on?
I tried this code:
public static Node getLocalNode(CqlSession cqlSession) {
Metadata metadata = cqlSession.getMetadata();
Map<UUID, Node> allNodes = metadata.getNodes();
Row row = cqlSession.execute("SELECT host_id FROM system.local").one();
UUID localUUID = row.getUuid("host_id");
Node localNode = null;
for (Node node : allNodes.values()) {
if (node.getHostId().equals(localUUID)) {
localNode = node;
break;
}
}
return localNode;
}
But it seems to return random nodes - which makes sense if it just sends the query to one of the nodes in the cluster. I was hoping to find a way without providing hardcoded configuration to determine what node the app is running on.

my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Correct. When running a query where token range ownership cannot be determined, a coordinator is "selected." There is a random component to that selection. But it does take things like network distance and resource utilization into account.
I'm going to advise reading the driver documentation on Load Balancing. This does a great job of explaining how the load balancing policies work with the newer drivers (>= 4.10).
In that doc you will find that query routing plans:
are different for each query, in order to balance the load across the cluster;
only contain nodes that are known to be able to process queries, i.e. neither ignored nor down;
favor local nodes over remote ones.
As far as being able to tell which apps are connected to which nodes, try using the execution information returned by the result set. You should be able to get the coordinator's endpoint and hostId that way.
ResultSet rs = session.execute("select host_id from system.local");
Row row = rs.one();
System.out.println(row.getUuid("host_id"));
System.out.println();
System.out.println(rs.getExecutionInfo().getCoordinator());
Output:
9788de64-08ee-4ab6-86a6-fdf387a9e4a2
Node(endPoint=/127.0.0.1:9042, hostId=9788de64-08ee-4ab6-86a6-fdf387a9e4a2, hashCode=2625653a)

You are correct. The Java driver connects to random nodes by design.
The Cassandra drivers (including the Java driver) are configured with a load-balancing policy (LBP) which determine which nodes the driver contacts and in which order when it runs a query against the cluster.
In your case, you didn't configure a load-balancing policy so it defaults to the DefaultLoadBalancingPolicy. The default policy calculates a query plan (list of nodes to contact) for every single query so each plan is different across queries.
The default policy gets a list of available nodes (down or unresponsive nodes are not included in the query plan) that will "prioritise" query replicas (replicas which own the data) in the local DC over non-replicas meaning replicas will be contacted as coordinators before other nodes. If there are 2 or more replicas available, they are ordered based on "healthiest" first. Also, the list in the query plan are shuffled around for randomness so the driver avoids contacting the same node(s) all the time.
Hopefully this clarifies why your app doesn't always hit the "local" node. For more details on how it works, see Load balancing with the Java driver.
I gather from your post that you want to circumvent the built-in load-balancing behaviour of the driver. It seems like you have a very edge case that I haven't come across and I'm not sure what outcome you're after. If you tell us what problem you are trying to solve, we might be able to provide a better answer. Cheers!

How to change property key in gremlin with graph engine as Janusgraph and storage backend as cassandra?

I am using cassandra 4.0 with Janusgraph 6.0, I have n nodes with Label as "januslabel" and property as "janusproperty", I want to change property name to "myproperty", I have tried the answer of this link, Rename property with Gremlin in Azure Cosmos DB
but I was not able to do this permanently, what I mean with permanently is that whenever I do restart cassandra or janusgraph I get the old property name "janusproperty".
How can I change this permanently?

When using JanusGraph, if no transaction is currently open, one will be automatically started once a Gremlin query is issued. Subsequent queries are also part of that transaction. Transactions need to be explicitly committed for any changes to be persisted using something like graph.tx().commit(). Transactions that are not committed will eventually time out and changes will be lost.

How to query Cassandra from a certain node and get the data from only that node own?

Cassandra use consistent hash to manage data, and after we use Cassandra driver to connect the cluster, the node we connect to may query from other nodes in the cluster to get the result. But for my current situation, I'm doing some testing for my algorithm, I want to give a certain tokenRange and query the data in the tokenRange and on a certain node, if some data in the tokenRange isn't in this node, I don't want the node query other node to get the result. Is it possible and how to achieve it?
I find Cassandra Python driver: force using a single node but this solution only provide the client's connection pool connect to a certain node, the node will still query other nodes.

Use the WhiteListRoundRobinPolicy and CL.ONE like linked in other question.
You can also extend the Statement to include a host and a custom load balancing policy to send the request to the host in the wrapper. Extend a policy and override make_query_plan, something like (untested just scratch, consider following pseudo code)
class StatementSingleHostRouting(DCAwareRoundRobinPolicy):
def make_query_plan(self, working_keyspace=None, query=None):
if query.host:
return [query.host]
return DCAwareRoundRobinPolicy.make_query_plan(self, working_keyspace, query)
If that host doesn't own the data it will still query other replicas though.

Which couchbase node will serve request?

I am having NodeJS service which talks to couchbase cluster to fetch the data. The couchbase cluster has 4 nodes(running on ip1, ip2, ip3, ip4) and service also is running on same 4 servers. On all the NodeJS services my connection string looks like this:
couchbase://ip1,ip2,ip3,ip4
but whenever I try to fetch some document from bucket X, console shows node on ip4 is doing that operation. No matter which NodeJS application is making request the same ip4 is serving all the request.
I want each NodeJS server to use their couchbase node so that RAM and CPU consumption on all the servers are equal so I changed the order of IPs in connection string but every time request is being served by same ip4.
I created another bucket and put my data in it and try to fetch it but again it went to same ip4. Can someone explain why is this happening and can it cause high load on one of the node?

What do you mean by "I want each NodeJS server to use their couchbase node"?
In Couchbase, part of the active dataset is on each node in the cluster. The sharding is automatic. When you have a cluster, the 1024 active vBuckets (shards) for each Bucket are spread out across all the nodes of the cluster. So with your 4 nodes, there will be 256 vBuckets on each node. Given the consistent hashing algorithm used by the Couchbase SDK, it will be able to tell from the key which vBucket that object goes into and combined with the cluster map it got from the cluster, know which node that vBucket lives in the cluster. So an app will be getting data from each of the nodes in the cluster if you have it configured correctly as the data is evenly spread out.
On the files system there will be as part of the Couchbase install a CLI tool call vbuckettool that takes an objectID and clustermap as arguments. All it does is the consistent hashing algorithm + the clustermap. So you can actually predict where an object will go even if it does not exist yet.
On a different note, best practice in production is to not run your application on the same nodes as Couchbase. It really is supposed to be separate to get the most out of its shared nothing architecture among other reasons.

Which Cassandra node should I connect to?

I might be misunderstanding something here, as it's not clear to me how I should connect to a Cassandra cluster. I have a Cassandra 1.2.1 cluster of 5 nodes managed by Priam, on AWS. i would like to use Astyanax to connect to this cluster by using a code similar to the code bellow:
conPool = new ConnectionPoolConfigurationImpl(getConecPoolName()) .setMaxConnsPerHost(CONNECTION_POOL_SIZE_PER_HOST).setSeeds(MY_IP_SEEDS)
.setMaxOperationsPerConnection(100) // 10000
What should I use as MY_IP_SEEDS? Should I use the IPs of all my nodes split by comma? Or should I use the IP of just 1 machine (the seed machine)? If I use the ip of just one machine, I am worried about overloading this machine with too many requests.
I know Priam has the "get_seeds" REST api (https://github.com/Netflix/Priam/wiki/REST-API) that for each node returns a list of IPs and I also know there is one seed per RAC. However, I am not sure what would happen if the seed node gets down... I would need to connect to others when trying to make new connections, right?

Seed nodes are only for finding the way into the cluster on node startup - no overload problems.
Of course one of the nodes must be reachable and up in the cluster to get the new one up and running.
So the best way is to update the seed list from Priam before starting the node. Priam should be behind an automatically updated DNS entry.
If you're highest availability you should regularly store the current list of seeds from Priam and store them in a mirrored fashion just as you store your puppet or chef config to be able to get nodes up even when Priam isn't reachable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string