Cosmos DB Date Index Not Efficient - azure

I have a collection with a Date field that is populated by a C# application using a DateTime object. This field is serialized to the following format "2018-06-10T17:32:48.3285735Z".
I haven't touched the Index Policy in the collection, so strings are using the Range index type. From what I've read in the documentation, that's the most efficient way to index dates, however, when I use the Date field in an ORDER BY clause, the query consumes at least 10x more RUs than if I were to query using the timestamp (_ts) number field. That means paying 10x more for this single collection.
To illustrate the issue:
SELECT TOP 100 * FROM c ORDER BY c.Date DESC
//query consumes a minimum of 500 RUs
SELECT TOP 100 * FROM c ORDER BY c._ts DESC
//query consumes 50 RUs
Is this how it is supposed to work or am I missing something? I suspect that if this was the expected behavior, it would be emphasized in the index documentation, and storing dates as numbers would be highlighted as the best practice.
EDIT:
This is the index policy for the collection (I never changed it).
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
"excludedPaths": []
}

This may have to do with index collisions (multiple values map to the same index term).
You may want to narrow the range of the filed Date and see if that helps. Basically, try this query:
SELECT TOP 100 * FROM c WHERE (c.Date BETWEEN '2000-01-01' AND '2100-01-01') ORDER BY c.Date DESC
Please note that the added filter should not charge the query result set.

Did you try specifically configuring for Range Queries?
I think by default strings are hashed and you have to specify indexing for range queries.
I found this in the documentation:
By default, Azure Cosmos DB indexes all string properties within
documents consistently with a Hash index.
Documentation link
For setting up a range query index on the collection:
DocumentCollection collection = new DocumentCollection { Id = "orders" };
collection.IndexingPolicy = new IndexingPolicy(new RangeIndex(DataType.String)
{ Precision = -1 });
await client.CreateDocumentCollectionAsync("/dbs/orderdb", collection);
The document they are querying against looks like this:
{
"id": "09152014101",
"OrderDate": "2014-09-15T23:14:25.7251173Z",
"ShipDate": "2014-09-30T23:14:25.7251173Z",
"Total": 113.39
}
Documentation link

I believe this is an optimisation deficiency when the query uses TOP and ORDER BY. I've found that whilst there is not much difference in RU for a range query using timestamp as number and timestamp as string, in scenarios such as yours the range index on string appears to be ignored.
User Voice issue here:
https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/32345410-optimise-top-with-order-by-clause-queries

Related

How to filter objects inside arrays in DynamoDB

I'm implementing query filters in my nodejs application.
In modeling, I have this schema:
"clause": [
{
"description": "test1",
"number": 200
},
{
"description": "test2",
"number": 201
},
{
"description": "test3",
"number": 202
},
],
Basically I need to inform an array of objects to the dynamo and I need to know which record contains this information in which I searched
I've had success filtering just one object within the array, like this:
const params: QueryCommandInput = {
TableName: config.CONTRACT_DB,
KeyConditionExpression: 'pk = :i',
FilterExpression: 'contains(#clause, :clause)',
ExpressionAttributeNames: {
'#clause': 'clause',
},
ExpressionAttributeValues: {
':i': `user#${user.id}`,
':clause': {
number: 200,
description: 'test1',
},
},
};
But it is necessary for me to know the values ​​of number and description, I failed to get the result by informing only one of the properties.
And I have no idea how I would implement a solution where the user enters multiple clauses
Has anyone had success in querying objects inside arrays in dynamodb?, I didn't find anything relevant here.
You cannot search by description with the schema you designed. To search for a value, that value must be in the partition key (pk) or sort key (sk). You need to change the schema. I suggest splitting your array so that each item in the array is an item in the database. Then you can set the pk to a pointer to the parent object, and set sk to a number if you want to preserve the order of items in the array.
Next, create a GSI with the description as the pk, and the sk is your choice depending on your needs. Then you can search for an exact match with description by querying the GSI. If you want a partial search (begins_with), you can set the pk to a fixed value such as "clauseDescription", and the sk to description.

CosmosDB Paging doesn't Return Correct Page Size

Here's my data model from CosmosDB:
{
"id": "100",
"BookID": "100",
"PublishDate": "2014-02-23",
"Authors": [
{
"FirstName": "Jerry",
"Title": "Writer"
},
{
"FirstName": "Sally",
"Title": "CEO"
},
{
"FirstName": "Tom",
"Title": "COO"
}
] }
I know we can do paging on the Book object level. For example, I am able to do a query on SELECT * FROM c and set page number and page size.
However, am I able to do paging on the sub-object level? In this case, on the Authors level?
I am asking this question because I used the exact same code on both Book related query and Authors related query. Book query has the correct result in terms of the page number and page size. But Authors query always return all the items in the array. The query for Authors is:
SELECT c.Authors FROM c WHERE c.BookID = "100"
The result is incorrect with the page size = 1, and page number = 1. It ends up returning all 3 of the authors.
So I was thinking maybe in Cosmosdb it used Book as an object, and paging only works on the Book level? That's why paging on the Authors level is not working?
I think there's a bit of misunderstanding around paging: Paging is related returning documents, not parts of a document.
If you ask for array elements, that's what you'll get. You'll get the full array, not a subset. Now, if you have, say, 100 documents, each with the same BookID=100, then paging will affect how many of those 100 documents are returned.
If you wanted to query the Authors in this case, you could do the following:
SELECT c.BookID, a.FirstName, a.Title FROM c JOIN a IN c.Authors
This is detailed here in the CosmosDB docs here: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-join

SELECT single field from embedded array in azure documentDB

I have a documentDB collection that looks like this sample:
{
"data1": "hello",
"data2": [
{
"key": "key1",
"value": "value1"
},
{
"key": "key2",
"value": "value2"
}
}
In reality the data has a lot of other fields and the embedded array has some fields where the data is quite large. I need to query the data and I care about the small "key" field in the data2 array but I do not need the large "value". I am finding returning all the value data is causing performance problems, but if I exclude the array data from the SELECT all together it is fast (so the data size is the issue).
I cannot figure out a way to return only the "key" but exclude the "value" in the embedded array.
I basically want SELECT r.data1, r.data2.key and to have it return as:
{
"data1": "hello",
"data2": [
{
"key": "key1"
},
{
"key": "key2"
}
}
but it doesn't seem possible to SELECT r.data2.key because it is in an array
A JOIN will cause it to return a copy of each document for each "data2" array element, which does not work for me. My only other option would be to migrate the data and put the data I want into its own array so I can select the whole object.
Is this possible some how that I have not been able to figure out?
Mike,
As you have surmised, this is not possible without a custom UDF until DocumentDB supports sub-queries. If you would like to go down that route, see the following answer for an example of how the UDF may have to look:
DocumentDB Sub Query
Good luck!

Mongoose/Mongodb previous and next in embedded document

I'm learning Mongodb/Mongoose/Express and have come across a fairly complex query (relative to my current level of understanding anyway) that I'm not sure how best to approach. I have a collection - to keep it simple let's call it entities - with an embedded actions array:
name: String
actions: [{
name: String
date: Date
}]
What I'd like to do is to return an array of documents with each containing the most recent action (or most recent to a specified date), and the next action (based on the same date).
Would this be possible with one find() query, or would I need to break this down into multiple queries and merge the results to generate one result array? I'm looking for the most efficient route possible.
Provided that your "actions" are inserted with the "most recent" being the last entry in the list, and usually this will be the case unless you are specifically updating items and changing dates, then all you really want to do is "project" the last item of the array. This is what the $slice projection operation is for:
Model.find({},{ "actions": { "$slice": -1 } },function(err,docs) {
// contains an array with the last item
});
If indeed you are "updating" array items and changing dates, but you want to query for the most recent on a regular basis, then you are probably best off keeping the array ordered. You can do this with a few modifiers such as:
Model.update(
{
"_id": ObjectId("541f7bbb699e6dd5a7caf2d6"),
},
{
"$push": { "actions": { "$each": [], "$sort": { "date": 1 } } }
},
function(err,numAffected) {
}
);
Which is actually more of a trick that you can do with the $sort modifier to simply sort the existing array elements without adding or removing. In versions prior to 2.6 you need the $slice "update" modifier in here as well, but this could be set to a value larger than the expected array elements if you did not actually want to restrict the possible size, but that is probably a good idea.
Unfortunately, if you were "updating" via a $set statement, then you cannot do this "sorting" in a single update statement, as MongoDB will not allow both types of operations on the array at once. But if you can live with that, then this is a way to keep the array ordered so the first query form works.
If it just seems too hard to keep an array ordered by date, then you can in fact retrieve the largest value my means of the .aggregate() method. This allows greater manipulation of the documents than is available to basic queries, at a little more cost:
Model.aggregate([
// Unwind the array to de-normalize as documents
{ "$unwind": "$actions" },
// Sort the contents per document _id and inner date
{ "$sort": { "_id": 1, "actions.date": 1 } },
// Group back with the "last" element only
{ "$group": {
"_id": "$_id",
"name": { "$last": "$name" },
"actions": { "$last": "$actions" }
}}
],
function(err,docs) {
})
And that will "pull apart" the array using the $unwind operator, then process with a next stage to $sort the contents by "date". In the $group pipeline stage the "_id" means to use the original document key to "collect" on, and the $last operator picks the field values from the "last" document ( de-normalized ) on that grouping boundary.
So there are various things that you can do, but of course the best way is to keep your array ordered and use the basic projection operators to simply get the last item in the list.

elasticsearch query dates by range

My elasticsearch has data, particularly something like this for dates:
{
"startTime": {
"type": "string",
"format": "yyyy/MM/dd",
"index": "analyzed",
"analyzer": "keyword"
}
}
I am adding a date range picker and want to use the dates picked to go query elasticsearch for data with startTime inside this range chosen. I'm not sure how to structure this query to elasticsearch, or if it will even work with this being a string field (I can potentially change it, though).
can anyone help me here?
Your field is a string, the format property is ignored. You should change your mapping and use the date type. Have a look here to see the core types available in elasticsearch.
I would use a filter instead of a query. It will be cached, thus faster. The following is an example for the last 7 days:
{
"filter" : {
"range" : {
"PublishTime" : {
"from" : "20130505T000000",
"to" : "20131105T235959"
}
}
}
}
Note that if you use the filter like this it's going to be the same filter the whole day, thus you would make good use of the cache.

Resources