Recently we moved our product from ArangoDB single node to ArangoDB cluster mode (version 3.3.9).
As a result we are experiencing much longer query times.
The above query is returning data on the single node ArangoDB within a few milliseconds:
WITH Node_Collection_1
FOR c IN Node_Collection_1
LET neighbours = (FOR v,e,p IN INBOUND c._id
Edge_Collection_1
RETURN v._key
)
return {
content: c._key,
neighbours: neighbours
}
On the clustered ArangoDB the same query on the same data is taking a few seconds (~15 seconds).
In our use case we have many small graphs (1 node collection and 1 edge collection per graph) and therefore we configured each collection to have only one shard since each graph won't be too large.
The cluster configuration is:
3 data nodes
3 agent nodes
1 coordinator node
All of them are distributed among 3 AWS machines on top of Kubernetes.
Assuming that both the Node_collection_1 and Edge_collection_1 are happened to be on the same data node, why there is a so drastic latency degradation compared to the single node setup? And what can be done in order to improve this?
Related
Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.
I have a Scylla cluster with 3 Nodes and 1 Table created with the below Query
CREATE TABLE id_features (
id int PRIMARY KEY,
id_feature_1 int,
id_feature_2 int,
)
I am issuing below query from the application
SELECT * FROM id_features where id in (1,2,3,4...120);
The query can have a maximum of 120 ids.
Will this Query contact all 3 nodes based on the token value of id`s to fetch data for 120 ids in the worst case?
Or only 1 node will be contacted to fetch the data for all the ids and multiple nodes are used only for high availability
Do the replication factor, consistency level, and load balancing policy will play any role in deciding the node?
Will this Query contact all 3 nodes based on the token value of ids to fetch data
Do the replication factor, consistency level, and load balancing policy will play any role in deciding the node?
It very much depends on things like replication factor (RF), query consistency, and load balancing policy. Specifically, if RF < number of nodes, then multiple nodes will be contacted, based on the hashed token value of id and the nodes primarily assigned to those token ranges.
But, given this statement:
Or only 1 node will be contacted to fetch the data for all the ids and multiple nodes are used only for high availability
...I get the sense that RF=3 in this case.
If the app is configured to use the (default) TokenAwarePolicy then yes, for single-key queries only, requests can be sent to the individual nodes.
But in this case, the query is using the IN operator. Based on the 120 potential entries, the query cannot determine a single node to send the query. In that case, the TokenAwarePolicy simply acts as a pass-through for its child policy (DCAwareRoundRobinPolicy), and it will pick a node at LOCAL distance to be the "coordinator." The coordinator node will then take on the additional tasks of routing replica requests and compiling the result set.
As to whether or not non-primary replicas are utilized in query plans, the answer is again "it depends." While the load balancing policies differ in implementation, in general all of them compute query plans which:
are different for each query, in order to balance the load across the cluster;
only contain hosts that are known to be able to process queries, i.e. neither ignored nor down;
favor local hosts over remote ones.
Taken from: https://docs.datastax.com/en/developer/java-driver/3.6/manual/load_balancing/#query-plan
So in a scenario where RF = number of nodes, a single node sometimes may be used to return all requested replicas.
Pro-tip:
Try not to use the IN operator with a list of 120 partition key entries. That is forcing Cassandra to perform random reads, where it really excels at sequential reads. If that's a query the app really needs to do, try:
Building a new table to better support that query pattern.
Not exceed double-digits of entries for IN.
I have a three node Cassandra (DSE) cluster where I don't care about data loss so I've set my RF to 1. I was wondering how Cassandra would respond to read/write requests if a node goes down (I have CL=ALL in my requests right now).
Ideally, I'd like these requests to succeed if the data exists - just on the remaining available nodes till I replace the dead node. This keyspace is essentially a really huge cache; I can replace any of the data in the event of a loss.
(Disclaimer: I'm a ScyllaDB employee)
Assuming your partition key was unique enough, when using RF=1 each of your 3 nodes contains 1/3 of your data. BTW, in this case CL=ONE/ALL is basically the same as there's only 1 replica for your data and no High Availability (HA).
Requests for "existing" data from the 2 up nodes will succeed. Still, when one of the 3 nodes is down a 1/3 of your client requests (for the existing data) will not succeed, as basically 1/3 of you data is not available, until the down node comes up (note that nodetool repair is irrelevant when using RF=1), so I guess restore from snapshot (if you have one available) is the only option.
While the node is down, once you execute nodetool decommission, the token ranges will re-distribute between the 2 up nodes, but that will apply only for new writes and reads.
You can read more about the ring architecture here:
http://docs.scylladb.com/architecture/ringarchitecture/
We have less than 50GB of data for a table and we are trying to come up with a reasonable design for our Cassandra database. With so little data we are thinking of having all data on each node (2 node cluster with replication factor of 2 to start with).
We want to use Cassandra for easy replication - safeguarding against failover, having copies of data in different parts of the world and Cassandra is brilliant for that.
Moreover, best model that we currently came up with would imply that a single query (consistency level 1-2) would involve getting data from multiple partitions (avg=2, 90th %=20). Most of the queries would ask for data from <= 2 partitions but some might go up to 5k.
So my question here is whether it is really a problem? Is Cassandra slow to retrieve data from multiple partitions if we ensure that all the partitions are on the single node?
EDIT:
Misread question my apologies for other folks coming here later. Please look at the code for TokenAwarePolicy as a basis to determine replica owners, once you have that you can combine your query with the IN query to get multiple partitions from a single node. Be mindful of total query size still.
Original for reference:
Don't get data from multiple partitions in a single query, the detail of the why is here
The TLDR you're better off querying asynchronously from multiple different partitions that requiring the coordinator to do that work.
You require more of a retry if you fail (which is particularly ugly when you have a very large partition or two in that query)
You're waiting on the slowest query for any response to come back, when you could be returning part of the answer as it comes in (or even include a progress meter based on the parts being done).
I did some testing on my machine and results are contradicting what Ryan Svihla proposed in another answer.
TL;DR storing same data in multiple partitions and retrieving via IN operator is much slower than storing the data in a single partition and retrieving it in one go. PLEASE NOTE, that all of the action is on a single Cassandra node (as the conclusion should be more than obvious for a distributed Cassandra cluster)
Case A
Insert X rows into a single partition of the table defined below. Retrieve all of them via SELECT specifying the partition key in WHERE.
Case B
Insert X rows each into a separate partition of the table defined below. Retrieve all of them via SELECT specifying multiple partition keys using WHERE pKey IN (...).
Table definition
pKey: Text PARTITION KEY
cColumn: Int CLUSTERING KEY
sParam: DateTime STATIC
param: Text (size of each was 500 B in tests)
Results
Using Phantom Driver
X = 100
A - 10ms
B - 150ms
r = 15
X = 1000
A - 20ms
B - 1400ms
r = 70
X = 10000
A - 100ms
B - 14000ms
r = 140
Using DevCenter (it has a limit of 1000 rows retrieved in one go)
X = 100
A - 20ms
B - 900ms
r = 45
X = 1000
A - 30ms
B - 1300ms
r = 43
Technical details:
Phantom driver v 2.13.0
Cassandra 3.0.9
Windows 10
DevCenter 1.6
I've started working with elasticsearch. I don't have any cluster nor shard nor replica. I just have some node that are not in a cluster.
First, I want to improve search for my site with elasticsearch. Now imagine I have 4 nodes,I want to know how many shards I should have in only one node ?
I don't want the default 5 shards. My requirements are the following :
qps=50
size of document=300k
size of ram for one node is 5G
How many shards are needed in one node in elasticsearch when we have no cluster ?
It's recommended to set the number of shardes relatively to number of nodes, meaning,if you have 1 node, you need 1 or two shardes, but it's also depend of the number of the documents , it's need to be plus minus 1 million documents per shard.
In conclusion, use one of two shardes, but if you have more than 2 millions documents, you need more nodes