Hello ArangoDb community,
I could not find documents related to shard key for ArangoSearch view. Without shardKey for ArangoSearch view, the request would hit all nodes and search performance would be very bad because it would hit all nodes in the cluster. Imagine 300 servers got hit every time someone search?
I've read this shard key document for collections, but could not find for ArangoSearch view.
https://www.arangodb.com/docs/stable/architecture-deployment-modes-cluster-sharding.html
Related
I know that secondary indices in Cassandra are generally a bad idea because the index is stored locally in each node i.e. not distributed across the cluster which may result in a query scanning a huge number of nodes. However, I don't understand why they are still a bad idea if I always specify the partition key in my queries and only use the secondary index as a final filter. I've read that they don't scale with large amounts of data even if I specify the partition key. Is this true? and if it's then why?
In general secondary indexes are bad idea, not only for the distributed part, but also for the index size and the number of distinct value, so if you have a field with high or low cardinality,you will be spending time on scanning many rows or many columns.
Also you can have other issue while dealing with tombstones ...
To answer your question, secondary index in Cassandra doesn't scale that good, but if you use a partition key and by it you tell Cassandra which node have the data, it perform really better !
you can find more details here in section F :
https://www.datastax.com/blog/2016/04/cassandra-native-secondary-index-deep-dive
I hope this helps !
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to
satisfy a query by indexed value, each node has to query its own records to build the
final result set (as opposed to a primary key query where it is known exactly which node
needs to be queried). So there's not just an impact on writes, but on read performance as
well.
Cassandra on a ring of five machines, with a primary index of user IDs and a secondary index of user emails. If you were to query for a user by their ID or by their primary indexed key any machine in the ring would know which machine has a record of that user. One query, one read from disk. However to query a user by their email or their secondary indexed value each machine has to query its own record of users. One query, five reads from disk. By either scaling the number of users system wide, or by scaling the number of machines in the ring, the noise to signal-to-ratio increases and the overall efficiency of reading drops. In some cases to the point of timing out also.
Please refer below link for good explanation on secondary index.
https://dzone.com/articles/cassandra-scale-problem
I was reading the Cosmos DB docs on best practices for query performance, and I found the following ambiguous:
With Azure Cosmos DB, typically queries perform in the following order
from fastest/most efficient to slower/less efficient.
GET on a single partition key and item key
Query with a filter clause on a single partition key
Query without an equality or range filter clause on any property
Query without filters
Is there a difference in performance or RUs between a "GET on a single partition key and item key" and a "QUERY on a single partition key and item key". It's not entirely clear to me whether this falls into case #1 or #2 or is somewhere in between.
Basically, I'm asking whether we ever need to use GET at all. The docs don't seem to clarify this anywhere.
A direct GET will be faster. As documented, a 1K document should cost 1 RU to retrieve. You will have a higher RU cost for a query, as you're engaging the query engine.
One caveat: with a direct read (the GET), you will retrieve the entire document. With a query, you can choose the projection of properties. For very large documents, this could result in significant bandwidth savings for your app, when using a query.
I need to implement a functionality to search users by their nickname.
I know that it's possible to create a SASI index on a nickname and the search will work. However, as far as I understand the query will be sent to all nodes in the cluster.
I want to modify a table and introduce a shard key which will be first letter of the nickname. Like that if user starts to search, we know that we need to forward the query only to specific node ( + replicas ).
P.S I know that such kind of pattern can create a hotspot. However, I think the trade-offs here are meaningful and in practice I should not get an issue due to this hotspot ( I don't expect to get billion users in my system ).
What do you think?
Thank you in advance.
Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.
I need to transfer large amounts of data from CouchDB. The query returns all available keys, for which I am requesting documents. There is an option immediately obtain all documents. But size is only transmitted keys takes longer 1GB. In MongoDB is there for such tasks cursor, but it uses a different protocol.
How can I get at once all the documents contained in CouchDB, fingering them one by one?
I tried to touch the keys portions, but I consider this option in the last turn.
The CouchDB Docs explain how to paginate results