Pagination - what it is and how to do it - pagination

I want the pagination below . Can you help me !
First | Previous | Page 2 of 5 | Next | Last
Pagination - what it is and how to do it

There are two ways to do that. In memory or not.
In memory is when your list is big enough to mess the UI, but not big enough to mess your server memory allocation (data x users > available memory). This is easier to implement but does not solve all the problems.
Not in memory is when your list is too big to fit in memory or it's too big so it compromises your application, speed, etc. In this case, you usually will query this data from a backend database using 2 parameters: the index of the first record and the offset (or page size). In this way, you're going to retrieve just a small amount of data, leaving the sorting/filtering heavyweight task to the database (which is pretty good in this kind of thing)

Pagination is used when you can have a large data set to return that you don't want to display all of it but chunk it into pages.
For example if you have a sql query that returned 10,000 rows but you only wanted to show 100 of them you would paginate them. ( those links you have above would be a reference to the page number of the result set you would want to return.
Then for your sql you can adjust the LIMIT in the query based on the page # . Hope this helps .

Related

Best way to Fetch N rows in ScyllaDB

I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.

AWS Keyspaces effective pagination

I have to jump to specific rows in AWS Keyspaces table. In specific I am doing pagination so I want to be able to jump to a specific page. Do I have any options for that?
For example, I have to fetch 100 rows after 1e+6 row. And ofc I want to do it as quickly as possible.
My ideas/solutions:
Set page size to requested one (100 in this case) and iterate over all rows and get next_page until come up with a specific set
Find max possible size of the page and use max_page to iterate over the biggest possible sets of rows
But maybe there are more clever solutions?
I don't have the opportunity to somehow change the table by adding additional columns!
Pagination isn't a best practice in Cassandra because you don't know how many results you will have until you query for them. Amazon Keyspaces paginates results based on the number of rows that it reads to process a request, not the number of rows returned in the result set. As a result, some pages might contain fewer rows than you specify in PAGE SIZE for filtered queries. In addition, Amazon Keyspaces paginates results automatically after reading 1 MB of data to provide customers with consistent, single-digit millisecond read performance.
For Paginating through pages we have a tool called the Export tool. This tool allows you to asynchronously read page by page. It will read partition keys even if you skip them. In Keyspaces still has to read every partition key which means using more RCUs, but this will accomplish your goal. When you are using the tool to read by page you may see the tool stop after a certain number of pages, you just restart the tool at the page it left off at when you run it again.

AWS DynamoDB count query results without retrieving

I would like to check how many entries are in a DynamoDB table that matches a query without retrieving the actual entries, using boto3.
I want to run a machine learning job on data from DynamoDB table. The data I'm training on is a data that answers a query, not the entire table. I want to run the job only if I have enough data to train on.
Therefore, I want to check if I want to check that I have enough entries that match the query.
It is worth mentioning that the DynamoDB table I'm querying is really big, therefore actual retrieving is no option unless I actually want to run the job.
I know that I can use boto3.dynamodb.describe_table() to get how many entries there are in the entire table, but as I mentioned earlier, I want to know only how many entries match a query.
Any ideas?
This was asked and answered in the past, see How to get item count from DynamoDB?
Basically, you need to use the "Select" parameter to tell DynamoDB to only count the query's results, instead of retrieving them.
As usual in DynamoDB, this is truncated by paging: if the result set (not the count - the actual full results) is larger than 1 MB, then only the first 1 MB is retrieved, and the items in it counted, and you get back this partial count. If you're only interested in checking whether you have "enough" results - this may even be better for you - because you don't want to pay for reading a gigabyte of data just to check if the data is there. You can even ask for a smaller page, to read less - depending on what you consider enough data.
Just remember that you'll pay Amazon not by the amount of data returned (just one integer, the count) but by the amount of data read from disk. Using such counts excessively may lead to surprising large costs.

When does driver datastax driver paging yields fewer pages than requested?

I am trying to use datastax-driver paging using fetch-size. However datastax documentation says following
Note that setting a fetch size doesn’t mean that Cassandra will always return the exact number of rows, it is possible that it returns slightly more or less results
Don't really know the internal details of paging implementation, but can someone please clarify in what situation we get more or less results from the server? For example, if I set the fetch-size to 10, based on the above statement it's possible to get 8 or 12 rows as a result. But I am trying to understand in what situation we will receive 8 (or 12) rows?
Note that setting a fetch size doesn’t mean that Cassandra will always return the exact number of rows, it is possible that it returns slightly more or less results
I'm not confident this statement is entirely true. You can expect that its possible for a page to contain less than the desired page size. For example, if your page size is 10, and there are only 8 rows that match your query criteria, of course you will only get 8 rows back.
However, I'm not familiar of a case where the server will send back more rows than the page size in a single page result. The native protocol specification even specifies that the message returned will contain at most the page size:
If a positive value is provided for result_page_size, the result set of the
RESULT message returned for the query will contain at most the result_page_size first rows of the query result.
Further, the protocol spec also states:
While the current implementation always respects the exact value of result_page_size, we reserve the right to return slightly smaller or bigger pages in the future for performance reasons.
I don't think that has been exercised, but might explain why the driver docs are phrased in this way.
Andy's answer is fairly complete, but I want to add a few more insights on why returning pages not exactly the desired size may be useful - in current or future implementations:
One reason why Cassandra may want to return short pages is filtering. Imagine that the request has ALLOW FILTERING, and needs to read a lot of data from disk just to produce a few rows that end up passing the filter and being returned to the client. The client, not aware of this, has asked for a page of 1000 rows - but in our example maybe actually generating 1000 rows passing the filter would take 10 seconds, and the client would time out if Cassandra waits 10 seconds before producing any results. So in this case, Cassandra should just return whatever rows it managed to collect before timing out - even if these are just 17 rows and not 1000 rows. The client would receive these 17 rows, and resume to the next page normally.
In the extreme case, there may be so much filtering work with so little output, that we can have a long time with not even a single row output. In this case, before timing out Cassandra may return a page with zero results, which has the has_more bit on, meaning the client should continue paging (the number of results being less than requested - or even zero - is not the sign of when to stop paging!). I'm not sure that Cassandra actually returns zero-row pages today, but Scylla (a faster Cassandra clone) definitely does, and drivers should remember to use the has_more bit as the only sign of when to stop paging.
The other question is why would paging return more rows than desired. As
Andy said in his reply, I don't think this actually happens in Cassandra, nor in Scylla. But I can understand why some future implementation may want it to allow it to happen: Imagine that a coordinator needs 1000 rows for a page. So it reads up to 1000 rows from each replica, but there's inconsistent data, and one replica has an extra row, and the result is that the coordinator now has 1001 rows to return. It can (and today, does), return only the first 1000 rows but the downside is that now some of the replicas are in the wrong place in the data and will need to refind their place when asked to read the next page. Had we returned all 1001 rows we found, all of the replicas will be able to resume their reads efficiently from exactly where they left off.

Why is search performance is slow for about 1M documents - how to scale the application?

I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700

Resources