CosmosDB not using indices under certain circumstances - azure

I noticed an odd behaviour of CosmosDB regarding the use of indices.
A few words to my setup:
It is a partitioned CosmosDB with 25 partitions.
There are two fields of arrays containing strings which are named a and f. They have the following indexing policy:
{
"path": "/a/[]/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": -1
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
}
]
},
{
"path": "/f/[]/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": -1
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
}
]
}
There might be the case that a string that is in field a for one document occurs in f in another document.
The odd behaviour occurs when I execute the following query:
SELECT *
FROM Documents d
WHERE ARRAY_CONTAINS(d.a, 'some-string')
If 'some-string' doesn't occur in any others document's f field, all paritions have an IndexHitRation of 1 (as seen in QueryMetrics included in response). This is the behaviour I expect.
But if 'some-string' does occur in any others document's f field, the partitions containing such a document report an IndexHitRatio of 0 which has a great impact on the used RUs.
Can there be any mistakes in my setup that lead to this behaviour?
Can any one else reproduce this behaviour, so this is a bug?

To get rid of this behaviour I used different precision values for each field. So, now field a has precision -1 and field f has precision 7.
My conclusion from this would be that they were written to the same index when using the same precision. But this would be some unexpected behaviour of a database?!

Related

Strange query results in Azure Cosmos DB

I have following documents in my Azure Cosmos DB:
{
"id": "token",
"User": {
"UserToken": "token",
"Email": "test#email.com"
},
"_ts": 1541493290
}
When I run the following query:
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["Email"], "token"))
ORDER BY root["_ts"] DESC
Nothing is returned. But when I change it a bit. For example byconverting Email to email:
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["email"], "token"))
ORDER BY root["_ts"] DESC
The result is found. Moreover when I remove ORDER BY clause, also query returns me a result. So the query is like following
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["Email"], "token"))
Moreover, when I edit the document (like open it, add an empty line and save), some magic happens in the background and the document is found. For quite "new" documents (less than 1-3 months), I can search them without my "magic" trick.
Indexes definition is:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
What I did wrong?
UPDATE the answer is not a full explanation but it helps a lot. Full explanation is in my blog (https://stapp.space/ridiculous-bug-in-azure-cosmos-db/)
CONTAINS(root["User"]["Email"], "token") won't work if you have strings indexed as Hash. They need to be Range with -1 precision. Hash only works for equality checks.
That's why the lowercase one is working. Because it cannot find the property and it just ignores it, falling back to the equality check. The first one finds it, sees that it's not indexed as Range and it just fails to return.
Changing indexing to this, will work:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
],
"excludedPaths": []
}
On a side note, the _ts field is not the best way to do ordering based on creation. It is a unix timestamp in seconds, so any documents created in the same second won't be properly ordered.

Azure Cosmos DB geospatial lookup consuming too high RU

I have a single Azure Cosmos DB collection I am querying against, hoping to use Geo-spatial index for efficient queries. The problem I'm encountering is that the RU consumption seems inefficient.
The collection has only 50k 1kb documents in it, but a query using ST_DISTANCE returning a single document consumes >900 RUs.
I've seen the RUs scale linearly based on the # of documents in the collection. It would seem indexing should prevent this behavior.
Example Query (950 RUs):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [34.69, -1.91] }) < 500
Example document:
[
{
"id": "1504891036",
"name": "Oujda",
"location": {
"type": "Point",
"coordinates": [
34.69,
-1.91
]
},
"population": 409391,
"country": "Morocco",
"country.iso2": "MA",
"country.iso3": "MAR",
}
]
I've not modified the default indexing policy, which seems to cover spatial indexing:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
"excludedPaths": []
}
I determined the problem. I had transposed the longitude and the latitude coordinate prescribed by GeoJSON:
Cosmos is expecting:
"location": {
"type": "Point",
"coordinates": [
<#lon>,
<#lat>
]
I had assumed, incorrectly, that it was lat/lon. Therefore many of my latitudes were outside of the 90/-90 range required, since longitude can be 180/-180. After re-creating my ~50k documents, RU for coordinate based lookups are consistently <10 RUs.
Before fix (all docs have transposed lat/lon coordinates, many outside the 90/-90 bounds and therefore invalid):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [34.69, -1.91] }) < 500
940 RUs, 1 document returned
After fix (all docs re-created with lat/lon set correctly per GeoJSON specs):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [-1.91,34.69] }) < 500
6 RUs, 1 document returned
Initial issue was confirmed/diagnosed by the following query:
SELECT ST_ISVALIDDETAILED(c.location) FROM c where c.name = "Kansas City"
Error: "Latitude values must be between -90 and 90 degrees."

Azure CosmosDb: Order-by item requires a range index

I'm performing a simple query via the Azure Portal "Query Explorer".
Here is my query:
SELECT * FROM c
WHERE c.DataType = 'Fruit'
AND c.ExperimentIdentifier = 'prod'
AND c.Param = 'banana'
AND Contains(c.SampleDateTime, '20171029')
ORDER BY c.SampleDateTime DESC
However, I get the exception:
Order-by item requires a range index to be defined on the corresponding index path.
There is no link to help regarding the error and I cannot make heads from tails, from that error message.
What does it mean, why is my query failing and how can I fix it?
P.S. the _ts property is no good to me as I do not want to order by the time the records were inserted.
ORDER BY is served directly from the index and thus it requires the order by item to be Range indexed (as opposed to Hash indexed).
While you could only index the order-by item as range (for both numbers and string), my advice is to index all paths as range with precision of -1.
Basically, you'd need to update the indexing policy of your collection to be something like this:
{
"automatic": true,
"indexingMode": "consistent",
"includedPaths": [
{
"path": "/",
"indexes": [
{ "kind": "Range", "dataType": "Number", "precision": -1 },
{ "kind": "Range", "dataType": "String", "precision": -1 }
]
}
]
}

Date Between Query in Cosmos DB

I am in the building a simple event store in Cosmos DB that has documents that are structured something like this:
{
"id": "e4c2bbd0-2885-4fb5-bcca-90436f79f155",
"entityType": "contact",
"history": [
{
"startDate": 1504656000,
"endDate": 1504656000,
"Name": "John"
},
{
"startDate": 1504828800,
"endDate": 1504828800,
"Name": "Jon"
}
]
}
This might not bet the most efficient way to store it but this is what I am starting with. But I want to be able to query all contact documents out of the db for a certain period of time. The startDate and endDate represent the time the record was valid. The history currently contains the entire history of the record which probably could be improved.
I have tried creating a query like this:
SELECT c.entityType, c.id,history.Name, history.startDate FROM c
JOIN history in c.history
where
c.entityType = "contact" AND
(history.StartDate <= 1504656001
AND history.EndDate >= 1504656001)
This query should return the state of the contact for 9/7/2017 but instead it is returning every one of the history. I have played with several options but I am not sure what I am missing.
I have also tried setting the index (maybe that is the issue?) So I have included the indexing policy here:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
}
]
}
],
"excludedPaths": []
}
What am I missing? Is the index correct? Is my query correct for a date between query?
You have two issues. One is addressed by Matias in comment.
Second, your condition is history.StartDate <= 1504656001 AND history.EndDate >= 1504656001.
play with the range for e.g. history.StartDate >= 1504656001 AND history.EndDate <= 1504656111.

Q: Azure Cosmos DB Graph: How to run queries in Graph API when Indexing Policy is defined as Manual?

In Cosmos DB graph when I am defining Indexing policy as Automatic, I am able to run queries but when I am updating indexing policy to Manual and defining Indexing path (/label/?) and Indexing mode set as 'Consistent', the query is not fetching any data.
Let's say my first query (when Indexing policy set as Manual) is :
g.addV('Azure').property('name','Cerulean Software'))
Result is :
[
{
"id": "0c14a00a-edf6-46b1-9e40-45cc37f750ea",
"label": "Azure",
"type": "vertex",
"properties": {
"name": [
{
"id": "f89ee2ee-74df-4256-a5d4-2b47eb526976",
"value": "Cerulean Software"
}
]
}
}
]
Now, my second query (when Indexing policy set as Manual (see Edit #1 below)) is:
g.V().hasLabel('Azure')
This second query is not fetching any result even though there is vertex present in graph named as 'Azure'.
What could be the possible reason behind this?
Edit #1: Manual Indexing Policy Before Change
"indexingPolicy": {
"automatic": false,
"excludedPaths": [],
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"dataType": "Number",
"kind": "Range",
"precision": -1
},
{
"dataType": "String",
"kind": "Hash",
"precision": 3
}
]
},
{
"path": "/label/?",
"indexes": [
{
"dataType": "String",
"kind": "Hash",
"precision": 3
},
{
"dataType": "Number",
"kind": "Range",
"precision": -1
}
]
}
],
"indexingMode": "consistent"
},
Edit #2: Manual Indexing Policy After Change
"indexingPolicy": {
"automatic": false,
"excludedPaths": [],
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"dataType": "Number",
"kind": "Range",
"precision": -1
},
{
"dataType": "String",
"kind": "Hash",
"precision": 3
}
]
},
{
"path": "/_isEdge/?",
"indexes": [
{
"dataType": "String",
"kind": "Hash",
"precision": 3
},
{
"dataType": "Number",
"kind": "Range",
"precision": -1
}
]
}
],
"indexingMode": "consistent"
},
With Cosmos, graph statements are not executed as traversals on the Azure side. The graph client actually translates gremlin statements into Document SQL calls and then aggregates the results back to you on the client side. In the case of your statement g.V().hasLabel('Azure') the call is actually translated to {"query":"SELECT N_2 FROM Node N_2 WHERE (IS_DEFINED(N_2._isEdge) = false AND (N_2.label = 'Azure'))"}
This can be verified through the use of a proxy such as Fiddler which will allow you to inspect the outbound calls from your machine.
The top level _isEdge property seems to be used across almost all Gremlin translated queries so I suspect that if you add that property to your indexing policy you should start to see the expected results.
EDIT:
I originally missed the part of your indexing policy that sets automatic: false. According to the Cosmos docs (under the heading Opting in and opting out of indexing), By default, all documents are automatically indexed, but you can choose to turn it off. When indexing is turned off, documents can be accessed only through their self-links or by queries using ID.
If you choose to run with indexing turned off, then the rest of your indexing policy is effectively meaningless and queries that aren't directly by document Id will no longer work. Can you elaborate as to what you're actually trying to accomplish here? There seems to be a bit of confusion. The indexing settings you've placed on label and isEdge aren't even necessary because they are the same as the value you've put for * which is the default rule matching all paths.
Post what your goals are for your indexing strategy and I can try to make an appropriate recommendation but you're definitely going to want to put automatic: true back into your policy.

Resources