How to do Paging based on dynamic validation criteria? - pagination

Here is the case:
I need to display a list of records on a page, and paging is necessary. The question I got is whether a record should be displayed depends on validation result computed in memory, after selected from database.
for example, 50 records for one page:
Select 50 records from database
30 records are left after validation
Solution I have now it get all records from database, do validation and then get valid list of records. Paging is based on this list.
Is there any other good solution for this?

In the optimal case pagination can be divided in two steps. In the first step according to the selecting query a set of rows is selected from a database. All these rows could be displayed. Instead of fetching actual rows just retrieve a list of their identifiers. This list however large can be usually kept in memory. Second step is paging through the list by asking for n-th page of m items. Then only m rows are being completely retrieved from the database using their ids.
Additional step of computation is negating the idea of pagination that is having a list of identifiers of the whole result set.
What I can think of now without seeing the computation is to store results of computation in the database whenever a displaying row is being inserted/updated in the db. Since computation results depend from input parameters then for every row and for every range of input parameters you could have a different result.
This would then make paging possible. Performing the first step of paging should now include precomputed validation results and provide much faster retrieval of the list of row ids.

Related

Remote pagination and last_page: filter during, or after, database query?

I would like to use Tabulator's remote pagination to load records from my database table, one page at a time. I expect that I should be able to use the page and size parameters sent by Tabulator to my remote back-end to determine which records to select from the database.
For example, with page=2 and size=10, I can use MySQL's LIMIT 10,20 to select the records to be shown on page 2 (if size is set to 10).
However, doing this precludes me from using the count of all records to determine the number of pages in the table. Doing a count on the returned records will only yield 10 records, even if there are a total of 500 records (for example), so only one pagination button will be shown (instead of the expected 50 buttons).
So in order to do remote pagination "correctly" in Tabulator, it seems I must do a query to count all records from my database (with no limits), then do a count to determine the last_page, and then do something like PHP's array_slice to extract the nth page's worth of records to return as the dataset. Or I can do 2 database queries: count all records to determine # of pages, and then do a LIMIT [start],[end] query.
Is this correct?
Tabulator needs to know the last page number in order to layout the pagination buttons in the table footer so that users can select the page they want to view from the list of pages.
You simply need to do a query to count the total number of records and divide it by the number of page size which is passed in the request. you can run a count query quite efficiently returning only the count and no data.
You can then run a standard query with a limit set on the records to retreive the records for that page.
If you want to optimize things further you could stick the count value in cache so that you dont need to generate it on each request.

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

How to invert a merge query in power query

I have a single column table of customer account numbers and a main table containing 400,000 records pulling from an access database. I want to remove all records from the table where the customer account number can be found in the single column table.
The merge query capability in power query allows me to return only the records where there is a match on the customer list (in addition to a variety of other variations on this theme) but I would like to know whether there is a way to invert this so that I return all records where the customer number does not appear in this list.
I have achieved this already by using the List.Contains function and adding a custom column to identify the rows to exclude and then filtering them out, but I think this is severely impacting the performance of my workbook. Refreshing the table that initially has 400,000 rows prior to this series of transformations takes a very long time, and all queries that depend on this table then also take a long time to refresh.
Thank you
If you do a Left Anti Join of your table with a single column, this will give you your table filtered to only have the rows which do not match to the single column.

MutliGet or multiple Get operations when paging

I have a wide column family used as a 'timeline' index, where column names are timestamps. In order to prevent hotspots, I shard the CF by month so that each month has its own row in the CF.
I query the CF for a slice range between two dates and limit the number of columns returned based on the page's records per page, say to 10.
The problem is that if my date range spans several months, I get 10 columns returned from each row, even if there is 10 matching columns in the first row - thus satisfying my paging requirement.
I can see the logic in this, but it strikes me as a real inefficiency if I have to retrieve redundant records from potentially multiple nodes when I only need the first 10 matching columns regardless of how many rows they span.
So my question is, am I better off to do a single Get operation on the first row and then do another Get operation on the second row if my first call doesnt return 10 records and continue until I have the required no. of records (or hit the row limit), or just accept the redundancy and dump the unneeded records?
I would sample your queries and record how many rows you needed to fetch for each one in order to get your 10 results and build a histogram of those numbers. Then, based on the histogram, figure out how many rows you would need to fetch at once in order to complete, say, 90% of your lookups with only a single query to Cassandra. That's a good start, at least.
If you almost always need to fetch more than one row, consider splitting your timeline by larger chunks than a month. Or, if you want to take a more flexible approach, use different bucket sizes based on the traffic for each individual timeline: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra (see the "Variable Time Bucket Sizes" section).

How to iterate over a SOLR shard which has over 100 million documents?

I would like to iterate over all these documents without having to load the entire result in memory which seems to be the case apparently - QueryResponse.getResults() returns SolrDocumentList which is an ArrayList.
Can't find anything in the documentation. Am using SOLR 4.
Note on the background of problem: I need to do this when adding a new SOLR shard to the existing shard cluster. In that case, I would like to move some documents from the existing shards to the newly added shard(s) based on consistent hashing. Our data grows constantly and we need to keep introducing new shards.
You can set the 'rows' and 'start' query params to paginate a result set. Query first with start = 0, then start = rows, start = 2*rows, etc. until you reach the end of the complete result set.
http://wiki.apache.org/solr/CommonQueryParameters#start
I have a possible solution I'm testing:
Solr paging 100 Million Document result set
pasted:
I am trying to do deep paging of very large result sets (e.g., over 100 million documents) using a separate indexed field (integer) into which I insert a random variable (between 0 and some known MAXINT). When querying large result sets, I do the initial field query with no rows returned and then based on the count, I divide the range 0 to MAXINT in order to get on average PAGE_COUNT results by doing the query again across a sub-range of the random variable and grabbing all the rows in that range. Obviously the actual number of rows will vary but it should follow a predictable distribution.

Resources