I am trying to use datastax-driver paging using fetch-size. However datastax documentation says following
Note that setting a fetch size doesn’t mean that Cassandra will always return the exact number of rows, it is possible that it returns slightly more or less results
Don't really know the internal details of paging implementation, but can someone please clarify in what situation we get more or less results from the server? For example, if I set the fetch-size to 10, based on the above statement it's possible to get 8 or 12 rows as a result. But I am trying to understand in what situation we will receive 8 (or 12) rows?
Note that setting a fetch size doesn’t mean that Cassandra will always return the exact number of rows, it is possible that it returns slightly more or less results
I'm not confident this statement is entirely true. You can expect that its possible for a page to contain less than the desired page size. For example, if your page size is 10, and there are only 8 rows that match your query criteria, of course you will only get 8 rows back.
However, I'm not familiar of a case where the server will send back more rows than the page size in a single page result. The native protocol specification even specifies that the message returned will contain at most the page size:
If a positive value is provided for result_page_size, the result set of the
RESULT message returned for the query will contain at most the result_page_size first rows of the query result.
Further, the protocol spec also states:
While the current implementation always respects the exact value of result_page_size, we reserve the right to return slightly smaller or bigger pages in the future for performance reasons.
I don't think that has been exercised, but might explain why the driver docs are phrased in this way.
Andy's answer is fairly complete, but I want to add a few more insights on why returning pages not exactly the desired size may be useful - in current or future implementations:
One reason why Cassandra may want to return short pages is filtering. Imagine that the request has ALLOW FILTERING, and needs to read a lot of data from disk just to produce a few rows that end up passing the filter and being returned to the client. The client, not aware of this, has asked for a page of 1000 rows - but in our example maybe actually generating 1000 rows passing the filter would take 10 seconds, and the client would time out if Cassandra waits 10 seconds before producing any results. So in this case, Cassandra should just return whatever rows it managed to collect before timing out - even if these are just 17 rows and not 1000 rows. The client would receive these 17 rows, and resume to the next page normally.
In the extreme case, there may be so much filtering work with so little output, that we can have a long time with not even a single row output. In this case, before timing out Cassandra may return a page with zero results, which has the has_more bit on, meaning the client should continue paging (the number of results being less than requested - or even zero - is not the sign of when to stop paging!). I'm not sure that Cassandra actually returns zero-row pages today, but Scylla (a faster Cassandra clone) definitely does, and drivers should remember to use the has_more bit as the only sign of when to stop paging.
The other question is why would paging return more rows than desired. As
Andy said in his reply, I don't think this actually happens in Cassandra, nor in Scylla. But I can understand why some future implementation may want it to allow it to happen: Imagine that a coordinator needs 1000 rows for a page. So it reads up to 1000 rows from each replica, but there's inconsistent data, and one replica has an extra row, and the result is that the coordinator now has 1001 rows to return. It can (and today, does), return only the first 1000 rows but the downside is that now some of the replicas are in the wrong place in the data and will need to refind their place when asked to read the next page. Had we returned all 1001 rows we found, all of the replicas will be able to resume their reads efficiently from exactly where they left off.
Related
I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.
I would like to check how many entries are in a DynamoDB table that matches a query without retrieving the actual entries, using boto3.
I want to run a machine learning job on data from DynamoDB table. The data I'm training on is a data that answers a query, not the entire table. I want to run the job only if I have enough data to train on.
Therefore, I want to check if I want to check that I have enough entries that match the query.
It is worth mentioning that the DynamoDB table I'm querying is really big, therefore actual retrieving is no option unless I actually want to run the job.
I know that I can use boto3.dynamodb.describe_table() to get how many entries there are in the entire table, but as I mentioned earlier, I want to know only how many entries match a query.
Any ideas?
This was asked and answered in the past, see How to get item count from DynamoDB?
Basically, you need to use the "Select" parameter to tell DynamoDB to only count the query's results, instead of retrieving them.
As usual in DynamoDB, this is truncated by paging: if the result set (not the count - the actual full results) is larger than 1 MB, then only the first 1 MB is retrieved, and the items in it counted, and you get back this partial count. If you're only interested in checking whether you have "enough" results - this may even be better for you - because you don't want to pay for reading a gigabyte of data just to check if the data is there. You can even ask for a smaller page, to read less - depending on what you consider enough data.
Just remember that you'll pay Amazon not by the amount of data returned (just one integer, the count) but by the amount of data read from disk. Using such counts excessively may lead to surprising large costs.
I am reading driver pagination in here. but CQL also support LIMIT in WHERE clause. Wonder what is the difference between these two
Pagination is how much you of your result you work with at a time.
WHERE and LIMIT are about what is in your result.
Imagine you request all rows where X < 100. This may refer to 1 million different rows. If the client or the server requested all of this at once it would cause a lot of resource pressure. To avoid this the driver is capable of asking for just a few rows at a time. This allows the client and the server to work with a stream of items rather than allocating space for everything up front.
I am trying to request a large number of documents from my database (which has over 400k documents). I started using _all_docs built-in view. I first tried with this query:
http://database:port/databasename/_all_docs?limit=100&include_docs=true
No problem. Completes as expected. Now to ramp it up:
http://database:port/databasename/_all_docs?limit=1000&include_docs=true
Still fine. Took longer, more data, etc. as expected. Ramp it up again:
http://database:port/databasename/_all_docs?limit=10000&include_docs=true
Request never completes. The Dev tools in chrome show Size = 5.3MB (seems to be significant), and this occurs no matter what value for the limit parameter I use that is over 6500ish. No matter if i specify 6500 or 10,000, it always returns 5.3MB downloaded, and the request stalls.
I have also tried other combinations, such as "skip" and it seems that limit + skip must be < 6500 or I get the same stall.
My environment: Couchdb 1.6.1, Ubuntu 14.04.3 LTS, Azure A1 standard
you have to prewarm your queries, just throwing a 100K or more docs and expecting that you'd get them out of couchdb won't work, it just won't work.
When you ask for some items from a view (in your case Default View), at the first read CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.
On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read.
There are 2 ways to handle this situation
1. Pre-warm your view - run a cronjob that does reads to make sure that your view has the B-Tree for this View.
2. Prepare your view in advance for a particular query before inserting the data in the couchdb.
and for now if you really want to read all your docs, don't read them all at once, rather use the skip, limit range queries.
I understand that CouchDB hashes the source of each design documents against the name of the index file. Whenever I change the source code, the index needs to be rebuild. CouchDB does this when the document is requested for the first time.
What I'd expect to happen and want to happen
Each time I change a design doc, the first call to a view will take significantly longer than usual and may time out. The index will continue to build. Once this is completed, the view will only process changes and will be very fast.
What actually happens
When running an amended view for the first time, I see the process in the status window, slowly reach 100%. This takes about 2 hours. During this time all CPU's are fully utilized.
Once process reaches 99% it remains there for about an hour and then disappears. CPU utilization drops to just one cpu.
When the process has disappeared, the data file for the view keeps growing for about half an hour to an hour. CPU utilization is near 0%
The index file suddenly stops to increase in size.
If I request the view again when I've reached state 4), the characteristics of 3) start again. I have to repeat this process between 5 to 50 times until I can finally retrieve the view values.
If the view get's requested a second time whilst till in stage 1 or 2, it will most definitely run out of memory and I have to restart the CouchDB service. This is despite my DB rarely using more than 2 GByte when runninng just one job and more than 4 GByte free in usual operation.
I have tried to tweak configuration settings, add more memory, but nothing seems to have an impact.
My Question
Do I misunderstand the concept of running views or is something wrong with my setup?
If this is expected, is there anything I can tweak to reduce the number of reruns?
Context
My documents are pretty large (1 to 20 MByte). The data they contain is well structured, they are usually web-analytics reports and would in a relational database be stored as several 10k rows of data.
My map function extracts these rows. It returns the dimensions as key array. The key array sometimes exceeds 20 columns. Most views will only have less than 10 columns.
The reduce function will aggregate (sum) all values in rows with identical keys. The metrics are stored in a dictionary and may contain different keys. The reduce function identifies missing keys in one document and adds these to the aggregate as 0.
I am using CouchDB 1.5.0 on Windows Server 2008 R2 with 2CPUs and 8 GByte memory.
The views are written in javascript using the couchjs query server.
My designs documents usually consist of several views, with a '_lib' view that does not emit any data, but contains an exhaustive library of functions accessed by the actual views.
It is a known issue, but just in case: if you have gigabytes of docs, you can forget about reduce functions. Only build-in ones will work fast enough.
It is possible to set os_process_limit to an extra-low value (1 sec, for sample). This way you can detect which doc takes long to be indexed and optimize your map function for performance.