Documentdb performance when using pagination

Documentdb performance when using pagination - azure

I have a working code on pagination which works great with azure search and sql, but when using it on documentdb it takes up to 60 seconds to load.
We beleive it's a latency issue, but I can't find a workaround to fasten it up,
any documentation, or ideas on where to start looking?
public PagedList(IQueryable<T> superset, int pageNumber, int pageSize, string sortExpression = null)
{
if (pageNumber < 1)
throw new ArgumentOutOfRangeException("pageNumber", pageNumber, "PageNumber cannot be below 1.");
if (pageSize < 1)
throw new ArgumentOutOfRangeException("pageSize", pageSize, "PageSize cannot be less than 1.");
// set source to blank list if superset is null to prevent exceptions
TotalItemCount = superset == null ? 0 : superset.Count();
if (superset != null && TotalItemCount > 0)
{
Subset.AddRange(pageNumber == 1
? superset.Skip(0).Take(pageSize).ToList()
: superset.Skip((pageNumber - 1) * pageSize).Take(pageSize).ToList()
);
}
}

While the LINQ provider for DocumentDB translates .Take() into a "TOP" SQL clause under certain circumstances, DocumentDB has no equivalent for Skip. So, I'm a little surprised it works at all but I suspect that the provider is rerunning the query from scratch to simulate Skip. In the comments here is a discussion led by a DocumentDB product manager on why they chose not to implement SKIP. tl;dr; It doesn't scale for NoSQL databases. I can confirm this with MongoDB (which does have a skip functionality). Later pages simply scan and throw away earlier documents. The later in the list you go, the slower it gets. I suspect that the LINQ implementation is doing something similar except client-side.
DocumentDB does have a mechanism for getting documents in chunks but it works a bit differently than SKIP. It uses a continuation token. You can even set a maxPageSize, however there is no guarantee that you'll get that number back.
I recommend that you implement a client-side cache of your own and use a fairly large maxPageSize. Let's say each page in your UI is 10 rows and your cache currently has 27 rows in it. If the user selects page 1 or page 2, you have enough rows to render the result from the data already cached. If the user select page 7, then you know that you need at least 70 rows in your cache. Use the last continuation token to get more until you have at least 70 rows in your cache and then render rows 61-70. On the plus side, continuation tokens are long lived so you can use them later based upon user input.

Related

How do I create a composite index for my Firestore query?

I'm trying to perform a firestore query on a collection which results in a failure because an index needs to be created for the query I'm attempting. The error contains a link that is suppose to auto create the missing index for me. However when I follow the link and attempt to create the index that has been prepared for me I encounter an error stating "name only indexes are not supported". I would also point out I have been using the npm functions-framework to test my cloud function that contains the relevant query.
I have tried creating the composite index myself manually but none of the index I have made seem to satisfy my attempted query.
Sample docs in my Items Collection:
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "fr" <string>
}
{
descriptionLastModified:someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
These are all queries I have tried which fail:
let queryRef = itemsRef.where('descriptionLastModified','<=', oneDayAgoTimestamp).orderBy("descriptionLastModified","desc").where("detectedLanguage", '==', "en-us").get()
let queryRef = itemsRef.where('descriptionLastModified','<=', oneDayAgoTimestamp).where("detectedLanguage", '==', "en-us").get()
let queryRef = itemsRef.where("detectedLanguage", '==', "en-us").where('descriptionLastModified','<=', oneDayAgoTimestamp).get()
I have made the following composite indexes at the collection level to no avail:
CollectionId:items Fields: descriptionLastModified:DESC detectedLangauge: ASC
CollectionId:items Fields: descriptionLastModified:ASC detectedLangauge: ASC
CollectionId:items Fields: detectedLangauge: ASC descriptionLastModified:DESC
My expectation is I should be able to filter my items by their descriptionLastModified timestamp field and additionally by the value of their detected Language string field.

In case anyone finds this in the future, its 2021, I still find composite indexes created manually, despite being incredibly simple, or you'd think and I fully understand why the OP thought his indexes would work, often just don't. Doubtless there is some subtlety that reading some guides would make clear but I haven't found the trick yet and have been using firestore for over 18 months intensively at work.
The trick is to use the link it creates, but this often fails, you get a dialogue box telling you an index will be created, but no details for you to manually create and the friendly blue 'create' button does nothing, it neither creates the index nor does it dismiss the window.
For a while I had it working in firefox but it stopped. A colleague across a couple of desks who has to create them a lot tells me that Edge is the most reliable, and you have to be very careful to not have multiple google accounts signed in - if edge (or chrome) takes you to the wrong login initially when following the link, even if you switch user back (and you have to do this because it will assume your default login rather than say the one currently selected in your only google cloud console window), even if you switch back its about a 1 in 3. He tells me in edge it works about 60%
I used to get about 30% with firefox just hitting refresh and soon a few times, but cant get it working other than in edge now, and actually, unless there is a client with little cash who will notice, I just go for inefficient and costly queries which return the superset of results and do some filters on the results. Mostly running in nodejs and its nippy enough for my purposes. Real shame to ramp up the read counts and consequential bills, but just doesn't seem a fix.

Can I efficiently query generic fields without resorting to HQL?

I find myself doing a lot of queries to fetch just the first couple of items of a big set, e.g. to show the three most recent news articles or blog posts on the homepage of a website.
As long as this query only involves predefined or custom Parts, I can do something like this:
public IEnumerable<ContentItem> GetTopArticles(int amount)
{
var cultureRecord = _cultureManager.GetCultureByName(_orchardServices.WorkContext.CurrentCulture);
var articles = _orchardServices.ContentManager.Query().ForType("Article")
.Where<LocalizationPartRecord>(lpr => lpr.CultureId == cultureRecord.Id)
.OrderBy<CommonPartRecord>(cpr => cpr.PublishedUtc)
.Slice(0, amount);
return articles;
}
I'm assuming this will more or less be the same as a SELECT TOP [amount] ... in SQL and will have good performance on a large number of records.
However, sometimes I use Migrations or Import to create Content Types from an external source and want to conditionally check a field from the generic Part. In this case I don't have a Part or PartRecord class that I can pass as a parameter to the ContentQuery methods and if I want to do a conditional check on any of the fields I currently do something like this:
public IEnumerable<ContentItem> GetTopArticles(int amount)
{
var articles = _orchardServices.ContentManager.Query().ForType("Article")
.OrderBy<CommonPartRecord>(cpr => cpr.PublishedUtc)
.List()
.Where(a => a.Content.Article.IsFeatured.Value == true)
.Take(amount);
return articles;
}
This is really wasteful and causes large overhead on big sets but I really, REALLY, do not want to delve into the database to figure out Orchard's inner workings and construct long and complex HQL queries every time I want to do something like this.
Is there any way to rewrite the second query with IContentQuery methods without incurring a large performance hit?

I'm working on something similar (being able to query model data with a dynamic name). Sadly, I haven't found anything that makes it easy.
The method I've found that works is to do plain SQL queries against the database. Check out this module for syntax on that if you do later find yourself willing to delve into the database.

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}

Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Querying azure table storage for null values

Does anyone know the proper way to query azure table storage for a null value. From what I've read, it's possible (although there is a bug which prevents it on development storage). However, I keep getting the following error when I do so on the live cloud storage:
One of the request inputs is not valid.
This is a dumbed down version of the LINQ query that I've put together.
var query = from fooBar in fooBarSVC.CreateQuery<FooBar>("FooBars")
where fooBar.PartitionKey == kPartitionID
&& fooBar.Code == kfooBarCode
&& fooBar.Effective_Date <= kFooBarDate.ToUniversalTime()
&& (fooBar.Termination_Date > kFooBarDate.ToUniversalTime() || fooBar.Termination_Date == null)
select fooBar;
If I run the query without checking for null, it works fine. I know a possible solution would be to run a second query on the collection that this query brings back. I don't mind doing that if I need to, but would like to know if I can get this approach to work first.
Anyone see anything obvious I'm doing wrong?

The problem is that because azure table storage does not have a schema, the null column actually doesn't exist. This is why your query is not valid. there is no such thing as a null column in table storage. You could do something like store an empty string if you really have to. Really though the fundamental issue here is that Azure table storage really is not built to be queried by any columns other than partition key and row key. Every time you make a query on one of these non-standard columns you are doing a table scan. If you start to get lots of data you are going to have a very high rate of query time outs. I would suggest setting up a manual index for these types of queries. For example, you could store the same data in the same table but with different values for the Row key. Ultimately, if your are app is not getting crazy high usage I would just use SQL Azure as it will be much more flexible for the types of queries you are doing.
Update: Azure has a great guide on table storage design that I would recommend reading. http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

I just had this problem and found a nice little ninja-trick to actually test for nulls. Although I'm using the Azure Storage interface directly, I'm 90% sure it will work for LINQ too if you do the same.
Here's what I did to check if Price (Int32?) is null:
not (Price lt 0 or Price gt 0)
I'm guessing in your case you can do the same in LINQ by testing if fooBar.Termination_Date is less or greater than DateTime.UtcNow for example. Something like this:
where fooBar.PartitionKey == kPartitionID
&& fooBar.Code == kfooBarCode
&& fooBar.Effective_Date <= kFooBarDate.ToUniversalTime()
&& (fooBar.Termination_Date > kFooBarDate.ToUniversalTime()
|| (not (fooBar.Termination_Date < DateTime.UtcNow
or fooBar.Termination_Date > DateTime.UtcNow))
select fooBar;

For a string column called MyColumn I was able to type: not(MyColumn gt '')
Mike S answer above put me on the right path.

For strings, we can compare to empty string.
IsNotBlank(value)
Can be:
(Value gt '')

Using the Azure Tables client library for .NET. to query for null Guid values.
In the sample code, the property's name is MyColumn.
var filter = Azure.Data.Tables.TableClient
.CreateQueryFilter($"not(MyColumn gt {Guid.Empty})");
The TableClient.CreateQueryFilter method will create the filter:
not(MyColumn gt guid'00000000-0000-0000-0000-000000000000')

Creating a pagination index in CouchDB?

I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found.
I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with a different file, though I seem to get the correct number of files emitted).
function(doc) {
if (doc.type == 'log') {
if (!pageIndex || pageIndex > 50) {
pageIndex = 1;
emit(doc.timestamp, null);
}
pageIndex++;
}
}
What am I doing wrong here? How would a CouchDB expert build this view?
Note that I don't want to use the "startkey + count + 1" method that's been mentioned elsewhere, since I'd like to be able to jump to a particular page or the last page (user expectations and all), I'd like to have a friendly "?page=5" URI instead of "?startkey=348ca1829328edefe3c5b38b3a1f36d1e988084b", and I'd rather CouchDB did this work instead of bulking up my application, if I can help it.
Thanks!

View functions (map and reduce) are purely functional. Side-effects such as setting a global variable are not supported. (When you move your application to BigCouch, how could multiple independent servers with arbitrary subsets of the data know what pageIndex is?)
Therefore the answer will have to involve a traditional map function, perhaps keyed by timestamp.
function(doc) {
if (doc.type == 'log') {
emit(doc.timestamp, null);
}
}
How can you get every 50th document? The simplest way is to add a skip=0 or skip=50, or skip=100 parameter. However that is not ideal (see below).
A way to pre-fetch the exact IDs of every 50th document is a _list function which only outputs every 50th row. (In practice you could use Mustache.JS or another template library to build HTML.)
function() {
var ddoc = this,
pageIndex = 0,
row;
send("[");
while(row = getRow()) {
if(pageIndex % 50 == 0) {
send(JSON.stringify(row));
}
pageIndex += 1;
}
send("]");
}
This will work for many situations, however it is not perfect. Here are some considerations I am thinking--not showstoppers necessarily, but it depends on your specific situation.
There is a reason the pretty URLs are discouraged. What does it mean if I load page 1, then a bunch of documents are inserted within the first 50, and then I click to page 2? If the data is changing a lot, there is no perfect user experience, the user must somehow feel the data changing.
The skip parameter and example _list function have the same problem: they do not scale. With skip you are still touching every row in the view starting from the beginning: finding it in the database file, reading it from disk, and then ignoring it, over and over, row by row, until you hit the skip value. For small values that's quite convenient but since you are grouping pages into sets of 50, I have to imagine that you will have thousands or more rows. That could make page views slow as the database is spinning its wheels most of the time.
The _list example has a similar problem, however you front-load all the work, running through the entire view from start to finish, and (presumably) sending the relevant document IDs to the client so it can quickly jump around the pages. But with hundreds of thousands of documents (you call them "log" so I assume you will have a ton) that will be an extremely slow query which is not cached.
In summary, for small data sets, you can get away with the page=1, page=2 form however you will bump into problems as your data set gets big. With the release of BigCouch, CouchDB is even better for log storage and analysis so (if that is what you are doing) you will definitely want to consider how high to scale.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string