If we get 1000 hits/s then,
How many nodes are required in elastic-search for a single complex query with aggregation.
Related
I have a Scylla cluster with 3 Nodes and 1 Table created with the below Query
CREATE TABLE id_features (
id int PRIMARY KEY,
id_feature_1 int,
id_feature_2 int,
)
I am issuing below query from the application
SELECT * FROM id_features where id in (1,2,3,4...120);
The query can have a maximum of 120 ids.
Will this Query contact all 3 nodes based on the token value of id`s to fetch data for 120 ids in the worst case?
Or only 1 node will be contacted to fetch the data for all the ids and multiple nodes are used only for high availability
Do the replication factor, consistency level, and load balancing policy will play any role in deciding the node?
Will this Query contact all 3 nodes based on the token value of ids to fetch data
Do the replication factor, consistency level, and load balancing policy will play any role in deciding the node?
It very much depends on things like replication factor (RF), query consistency, and load balancing policy. Specifically, if RF < number of nodes, then multiple nodes will be contacted, based on the hashed token value of id and the nodes primarily assigned to those token ranges.
But, given this statement:
Or only 1 node will be contacted to fetch the data for all the ids and multiple nodes are used only for high availability
...I get the sense that RF=3 in this case.
If the app is configured to use the (default) TokenAwarePolicy then yes, for single-key queries only, requests can be sent to the individual nodes.
But in this case, the query is using the IN operator. Based on the 120 potential entries, the query cannot determine a single node to send the query. In that case, the TokenAwarePolicy simply acts as a pass-through for its child policy (DCAwareRoundRobinPolicy), and it will pick a node at LOCAL distance to be the "coordinator." The coordinator node will then take on the additional tasks of routing replica requests and compiling the result set.
As to whether or not non-primary replicas are utilized in query plans, the answer is again "it depends." While the load balancing policies differ in implementation, in general all of them compute query plans which:
are different for each query, in order to balance the load across the cluster;
only contain hosts that are known to be able to process queries, i.e. neither ignored nor down;
favor local hosts over remote ones.
Taken from: https://docs.datastax.com/en/developer/java-driver/3.6/manual/load_balancing/#query-plan
So in a scenario where RF = number of nodes, a single node sometimes may be used to return all requested replicas.
Pro-tip:
Try not to use the IN operator with a list of 120 partition key entries. That is forcing Cassandra to perform random reads, where it really excels at sequential reads. If that's a query the app really needs to do, try:
Building a new table to better support that query pattern.
Not exceed double-digits of entries for IN.
I am trying to querying the collection in MongoDB which matches more than 10000 data for the query. Even though I have used index, the querying time exceeds 25 seconds.
For example, I am having a table People with field name, age.
I need to fetch the People data whose age is 25, if query finds the matched objects is 10000, then it takes time to fetch the whole data.
I have created index like db.people.createIndex({"age":1})
Here, how can I reduce the querying time
run db.collection.find().explain() and make sure that your index is in fact used. Make sure that you do not have COLLSCANs there https://docs.mongodb.com/manual/reference/explain-results/.
if your documents have some/many large attributes and you need only some attributes try to request only them (e.g. only _id or _id and name). Less data transferred gives higher speed.
if your db does not fit in memory, make it fit in memory. Once the database does not fit the performance will be much worse.
if you are not running on a sharded cluster, create one based on a reasonable sharding key. Age may not be a good one because than all age=25 documents will end up on one node. Even if you have one computer with multiple CPUs it still may work better for you (if you have enough memory for that). It may even work the other way around. If you have a sharded cluster on one computer and your replicas do not fit in the memory, it may be better to use just one node.
I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
FYI.
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar
If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
tl;dr;
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.
Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. http://docs.datastax.com/en/developer/java-driver/3.1/manual/async/#async-paging explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.
Mongoose 4.4 now has an insertMany function which lets you validate an array of documents and insert them if valid all with one operation, rather than one for each document:
var arr = [{ name: 'Star Wars' }, { name: 'The Empire Strikes Back' }];
Movies.insertMany(arr, function(error, docs) {});
If I have a very large array, should I batch these? Or is there no limit on the size or array?
For example, I want to create a new document for every Movie, and I have 10,000 movies.
I'd recommend based on personal experience, batch of 100-200 gives you good blend of performance without putting strain on your system.
insertMany group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the queue consists of 2000 operations, MongoDB creates 2 groups, each with 1000 operations.
The sizes and grouping mechanics are internal performance details and are subject to change in future versions.
Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list since with an ordered list, each operation must wait for the previous operation to finish.
Mongo 3.6 update:
The limit for insertMany() has increased in Mongo 3.6 from 1000 to 100,000.
Meaning that now, groups above 100,000 operations will be divided into smaller groups accordingly.
For example: a queue that has 200,000 operations will be split and Mongo will create 2 groups of 100,000 each.
The method takes that array and starts inserting them through the insertMany method in MongoDB, so the size of the array itself actually depends on how much your machine can handle.
But please note that there is another point, which is not a limitation but something worth keeping into consideration, on how MongoDB deals with multiple operations, by default it handles a batch of 1000 operations at a time and splits whats more than that.
I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo
Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.