In the description of SmartGraphs here it seems to imply that graph traversal queries actually follow edges from machine to machine until the query finishes executing. Is that how it actually works? For example, suppose that you have the following query that retrieves 1-hop, 2-hop, and 3-hop friends starting from the person with id 12345:
FOR p IN Person
FILTER p._key == 12345
FOR friend IN 1..3 OUTBOUND p knows
RETURN friend
Can someone please walk me through the lifetime of this query starting from the client and ending with the results on the client?
what actually happens can be a bit different compared to the schemas on our website. What we show there is kind of a "worst case" where the data can not be sharded perfectly (just to make it a bit more fun). But let's take a quick step back first to describe the different roles within an ArangoDB cluster. If you are already aware of our cluster lingo/architecture, please skip the next paragraph.
You have the coordinator which, as the name says, coordinates the query execution and is also the place where the final result set gets built up to send it back to the client. Coordinators are stateless, host a query engine and is are the place where Foxx services live. The actual data is stored on the DBservers in a stateful fashion but DBservers also have a distributed query engine which plays a vital role in all our distributed query processing. The brain of the cluster is the agency with at least three agents running the RAFT consensus protocol.
When you sharded your graph data set as a SmartGraph, then the following happens when a query is being sent to a Coordinator.
- The Coordinator knows which data needed for the query resides on which machine
and distributes the query accordingly to the respective DBservers.
- Each DBserver has its own query engine and processes the incoming query from the Coordinator locally and then sends the intermediate result back to the coordinator where the final result set gets put together. This runs in parallel.
- The Coordinator sends then result back to the client.
In case you have a perfectly shardable graph (e.g. a hierarchy with its branches being the shards //Use Case could be e.g. Bill of Materials or Network Analytics) then you can achieve the performance close to a single instance because queries can be sent to the right DBservers and no network hops are required.
If you have a much more "unstructured" graph like a social network where connections can occur among any two given vertices, sharding becomes an optimization question and, depending on the query, it is more likely that network hops between servers occur. This latter case is shown in the schemas on our website. In his case, the SmartGraph feature can minimize the network hops needed to a minimum but not completely.
Hope this helped a bit.
Related
I am trying to write some synchronization code for a java app that runs on each of the cassandra servers in our cluster (so each server has 1 cassandra instance + our app). For this I wanted to make a method that will return the 'local' cassandra node, using the java driver.
Every process creates a cqlSession using the local address as contactPoint. The driver will figure out the rest of the cluster from that. But my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Is there a way in the Java driver to determine which of the x nodes the process its running on?
I tried this code:
public static Node getLocalNode(CqlSession cqlSession) {
Metadata metadata = cqlSession.getMetadata();
Map<UUID, Node> allNodes = metadata.getNodes();
Row row = cqlSession.execute("SELECT host_id FROM system.local").one();
UUID localUUID = row.getUuid("host_id");
Node localNode = null;
for (Node node : allNodes.values()) {
if (node.getHostId().equals(localUUID)) {
localNode = node;
break;
}
}
return localNode;
}
But it seems to return random nodes - which makes sense if it just sends the query to one of the nodes in the cluster. I was hoping to find a way without providing hardcoded configuration to determine what node the app is running on.
my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Correct. When running a query where token range ownership cannot be determined, a coordinator is "selected." There is a random component to that selection. But it does take things like network distance and resource utilization into account.
I'm going to advise reading the driver documentation on Load Balancing. This does a great job of explaining how the load balancing policies work with the newer drivers (>= 4.10).
In that doc you will find that query routing plans:
are different for each query, in order to balance the load across the cluster;
only contain nodes that are known to be able to process queries, i.e. neither ignored nor down;
favor local nodes over remote ones.
As far as being able to tell which apps are connected to which nodes, try using the execution information returned by the result set. You should be able to get the coordinator's endpoint and hostId that way.
ResultSet rs = session.execute("select host_id from system.local");
Row row = rs.one();
System.out.println(row.getUuid("host_id"));
System.out.println();
System.out.println(rs.getExecutionInfo().getCoordinator());
Output:
9788de64-08ee-4ab6-86a6-fdf387a9e4a2
Node(endPoint=/127.0.0.1:9042, hostId=9788de64-08ee-4ab6-86a6-fdf387a9e4a2, hashCode=2625653a)
You are correct. The Java driver connects to random nodes by design.
The Cassandra drivers (including the Java driver) are configured with a load-balancing policy (LBP) which determine which nodes the driver contacts and in which order when it runs a query against the cluster.
In your case, you didn't configure a load-balancing policy so it defaults to the DefaultLoadBalancingPolicy. The default policy calculates a query plan (list of nodes to contact) for every single query so each plan is different across queries.
The default policy gets a list of available nodes (down or unresponsive nodes are not included in the query plan) that will "prioritise" query replicas (replicas which own the data) in the local DC over non-replicas meaning replicas will be contacted as coordinators before other nodes. If there are 2 or more replicas available, they are ordered based on "healthiest" first. Also, the list in the query plan are shuffled around for randomness so the driver avoids contacting the same node(s) all the time.
Hopefully this clarifies why your app doesn't always hit the "local" node. For more details on how it works, see Load balancing with the Java driver.
I gather from your post that you want to circumvent the built-in load-balancing behaviour of the driver. It seems like you have a very edge case that I haven't come across and I'm not sure what outcome you're after. If you tell us what problem you are trying to solve, we might be able to provide a better answer. Cheers!
Scenario
I need to design a system which should be able to read from local database and perform writes to all its replicas
(currently using Azure Tables).
I would appreciate if someone could share their approach to achieve this.
Existing design
Region 1: A node with "computeScore" service running. And a database of all customer data.
Region 2: A node with "computeScore" service running.
ComputeScore Service is called at every successful login of a user. It reads the previous login information of the same user from the database in Region 1 and computes a score. This score is written again to the database in Region 1.
Issue
Whenever a customer request is routed to Region 2 (based on the current location of the users, the requests are routed to their nearest physical server), it makes an extra call to Region 1 database to perform the database operations, which obviously results in extra time when compared to customers hitting Region 1.
One way to do it is to maintain a copy of the database at both the regions, but the challenge here would be consistency (writing back the score that is computed) and how do you achieve this.
Looking for a solution to avoid the extra latency for customers directed to Region 2.
I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.
If we configured our replication factor in such a way that there are no replica nodes (Data is stored in one place/Node only) and if the Node contains requested data is down, How will the request be handled by Cassandra DB?
Will it return no data or Other nodes gossip and somehow pick up data from failed Node(Storage) and send the required response? If data is picked up, Will data transfer between nodes happen as soon as Node is down(GOSSIP PROTOCOL) or after a request is made?
Have researched for long time on how GOSSIP happens and high availability of Cassandra but was wondering availability of data in case of "No Replicas" since I do not want to waste additional Storage for occasional failures and at the same time, I need availability and No data loss(though delayed)
I assume when you say that there is "no replica nodes" you mean that you have set the Replication Factor=1. In this case if the request is a Read then it will fail, if the request is a write it will be stored as a hint, up to the maximum hint time, and will be replayed. If the node is down for longer than the hint time then that write will be lost. Hinted Handoff: repair during write path
In general only having a single replica of data in your C* cluster goes against some the basic design of how C* is to be used and is an anti-pattern. Data duplication is a normal and expected part of using C* and is what allows for it's high availability aspects. Having an RF=1 introduces a single point of failure into the system as the server containing that data can go out for any of a variety of reasons (including things like maintenance) which will cause requests to fail.
If you are truly looking for a solution that provides high availability and no data loss then you need to increase your replication factor (the standard I usually see is RF=3) and setup your clusters hardware in such a manner as to reduce/remove potential single points of failure.
My team is testing the token aware connection pool of Astyanax. How can we measure effectiveness of the connection pool type, i.e. how can we know how the tokens are distributed in a ring and how client connections are distributed across them?
Our initial tests by counting the number of open connection on network cards show that only 3 out of 4 or more Cassandra instances in a ring are used and the other nodes participate in request processing in a very limited scope.
What other information would help making a valid judgment/verification? Is there an Cassandra/Astyanax API or command line tools to help us out?
Use Opscenter. This will show you how balanced your cluster is, i.e. whether each node has the same amount of data, as well asbeing able to graph the incoming read / write request per node and for your entire cluster. It is free and works with open source Cassandra as well as DSE. http://www.datastax.com/what-we-offer/products-services/datastax-opscenter