I have a single Azure Cosmos DB collection I am querying against, hoping to use Geo-spatial index for efficient queries. The problem I'm encountering is that the RU consumption seems inefficient.
The collection has only 50k 1kb documents in it, but a query using ST_DISTANCE returning a single document consumes >900 RUs.
I've seen the RUs scale linearly based on the # of documents in the collection. It would seem indexing should prevent this behavior.
Example Query (950 RUs):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [34.69, -1.91] }) < 500
Example document:
[
{
"id": "1504891036",
"name": "Oujda",
"location": {
"type": "Point",
"coordinates": [
34.69,
-1.91
]
},
"population": 409391,
"country": "Morocco",
"country.iso2": "MA",
"country.iso3": "MAR",
}
]
I've not modified the default indexing policy, which seems to cover spatial indexing:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
"excludedPaths": []
}
I determined the problem. I had transposed the longitude and the latitude coordinate prescribed by GeoJSON:
Cosmos is expecting:
"location": {
"type": "Point",
"coordinates": [
<#lon>,
<#lat>
]
I had assumed, incorrectly, that it was lat/lon. Therefore many of my latitudes were outside of the 90/-90 range required, since longitude can be 180/-180. After re-creating my ~50k documents, RU for coordinate based lookups are consistently <10 RUs.
Before fix (all docs have transposed lat/lon coordinates, many outside the 90/-90 bounds and therefore invalid):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [34.69, -1.91] }) < 500
940 RUs, 1 document returned
After fix (all docs re-created with lat/lon set correctly per GeoJSON specs):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [-1.91,34.69] }) < 500
6 RUs, 1 document returned
Initial issue was confirmed/diagnosed by the following query:
SELECT ST_ISVALIDDETAILED(c.location) FROM c where c.name = "Kansas City"
Error: "Latitude values must be between -90 and 90 degrees."
Related
I have a problem indexing an array in Azure Cosmos DB
I am trying to save this indexing policy via the portal
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/DeviceId",
"order": "ascending"
},
{
"path": "/TimeStamp",
"order": "ascending"
},
{
"path": "/Items/[]/Name/?",
"order": "ascending"
},
{
"path": "/Items/[]/DoubleValue/?",
"order": "ascending"
}
]
]
}
I get the error "Failed to update container DeviceEvents:
Message: {"code":"BadRequest","message":"Message: {"Errors":["The indexing path '\/Items\/[]\/Name\/?' could not be accepted, failed near position '8'."
This seems to be the array [] syntax that is giving an error.
On a side note I am not sure what I am doing makes sense at all but I have a query that looks like this
SELECT SUM(de0["DoubleValue"])
FROM root JOIN de0 IN root["Items"]
WHERE root["ApplicationId"] = 57 AND root["DeviceId"] = 126 AND root["TimeStamp"] >= "2021-02-21T17:55:29.7389397Z" AND de0["Name"] = "Use Case"
Where ApplicationId is the partition key and the item saved looks like this
{
"id": "59ab9323-26ca-436f-8d29-e1ddd826f025",
"DeviceId": 3,
"ApplicationId": 3,
"RawData": "640F7A000A00E30142000000",
"TimeStamp": "2021-02-20T18:36:52.833174Z",
"Items": [
{
"Name": "Battery Status",
"StringValue": "Full",
"DoubleValue": null
},
{
"Name": "Use Case",
"StringValue": null,
"DoubleValue": 12
},
{
"Name": "Battery Voltage",
"StringValue": null,
"DoubleValue": 3.962
},
{
"Name": "Rain Gauge Count",
"StringValue": null,
"DoubleValue": 10
}
],
"_rid": "CgdVAO7B0DNkAAAAAAAAAA==",
"_self": "dbs/CgdVAA==/colls/CgdVAO7B0DM=/docs/CgdVAO7B0DNkAAAAAAAAAA==/",
"_etag": "\"61008771-0000-0d00-0000-603156c50000\"",
"_attachments": "attachments/",
"_ts": 1613846213
}
I need to aggregate on some of these items in the array like say get MAX on temperature or something like this (using Use Case for test although it doesn't make sense). I reasoned that if all the data in the query is in a single composite index the database would be able to do the aggregation without reading the documents themselves. However I can't seem to add a composite index containing an array at all.
Yes, composite index can't contain an array path. It should be a scalar value.
Unlike with included or excluded paths, you can't create a path with
the /* wildcard. Every composite path has an implicit /? at the end of
the path that you don't need to specify. Composite paths lead to a
scalar value and this is the only value that is included in the
composite index.
Reference:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#composite-indexes
I have following documents in my Azure Cosmos DB:
{
"id": "token",
"User": {
"UserToken": "token",
"Email": "test#email.com"
},
"_ts": 1541493290
}
When I run the following query:
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["Email"], "token"))
ORDER BY root["_ts"] DESC
Nothing is returned. But when I change it a bit. For example byconverting Email to email:
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["email"], "token"))
ORDER BY root["_ts"] DESC
The result is found. Moreover when I remove ORDER BY clause, also query returns me a result. So the query is like following
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["Email"], "token"))
Moreover, when I edit the document (like open it, add an empty line and save), some magic happens in the background and the document is found. For quite "new" documents (less than 1-3 months), I can search them without my "magic" trick.
Indexes definition is:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
What I did wrong?
UPDATE the answer is not a full explanation but it helps a lot. Full explanation is in my blog (https://stapp.space/ridiculous-bug-in-azure-cosmos-db/)
CONTAINS(root["User"]["Email"], "token") won't work if you have strings indexed as Hash. They need to be Range with -1 precision. Hash only works for equality checks.
That's why the lowercase one is working. Because it cannot find the property and it just ignores it, falling back to the equality check. The first one finds it, sees that it's not indexed as Range and it just fails to return.
Changing indexing to this, will work:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
],
"excludedPaths": []
}
On a side note, the _ts field is not the best way to do ordering based on creation. It is a unix timestamp in seconds, so any documents created in the same second won't be properly ordered.
I am in the building a simple event store in Cosmos DB that has documents that are structured something like this:
{
"id": "e4c2bbd0-2885-4fb5-bcca-90436f79f155",
"entityType": "contact",
"history": [
{
"startDate": 1504656000,
"endDate": 1504656000,
"Name": "John"
},
{
"startDate": 1504828800,
"endDate": 1504828800,
"Name": "Jon"
}
]
}
This might not bet the most efficient way to store it but this is what I am starting with. But I want to be able to query all contact documents out of the db for a certain period of time. The startDate and endDate represent the time the record was valid. The history currently contains the entire history of the record which probably could be improved.
I have tried creating a query like this:
SELECT c.entityType, c.id,history.Name, history.startDate FROM c
JOIN history in c.history
where
c.entityType = "contact" AND
(history.StartDate <= 1504656001
AND history.EndDate >= 1504656001)
This query should return the state of the contact for 9/7/2017 but instead it is returning every one of the history. I have played with several options but I am not sure what I am missing.
I have also tried setting the index (maybe that is the issue?) So I have included the indexing policy here:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
}
]
}
],
"excludedPaths": []
}
What am I missing? Is the index correct? Is my query correct for a date between query?
You have two issues. One is addressed by Matias in comment.
Second, your condition is history.StartDate <= 1504656001 AND history.EndDate >= 1504656001.
play with the range for e.g. history.StartDate >= 1504656001 AND history.EndDate <= 1504656111.
In Cosmos DB graph when I am defining Indexing policy as Automatic, I am able to run queries but when I am updating indexing policy to Manual and defining Indexing path (/label/?) and Indexing mode set as 'Consistent', the query is not fetching any data.
Let's say my first query (when Indexing policy set as Manual) is :
g.addV('Azure').property('name','Cerulean Software'))
Result is :
[
{
"id": "0c14a00a-edf6-46b1-9e40-45cc37f750ea",
"label": "Azure",
"type": "vertex",
"properties": {
"name": [
{
"id": "f89ee2ee-74df-4256-a5d4-2b47eb526976",
"value": "Cerulean Software"
}
]
}
}
]
Now, my second query (when Indexing policy set as Manual (see Edit #1 below)) is:
g.V().hasLabel('Azure')
This second query is not fetching any result even though there is vertex present in graph named as 'Azure'.
What could be the possible reason behind this?
Edit #1: Manual Indexing Policy Before Change
"indexingPolicy": {
"automatic": false,
"excludedPaths": [],
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"dataType": "Number",
"kind": "Range",
"precision": -1
},
{
"dataType": "String",
"kind": "Hash",
"precision": 3
}
]
},
{
"path": "/label/?",
"indexes": [
{
"dataType": "String",
"kind": "Hash",
"precision": 3
},
{
"dataType": "Number",
"kind": "Range",
"precision": -1
}
]
}
],
"indexingMode": "consistent"
},
Edit #2: Manual Indexing Policy After Change
"indexingPolicy": {
"automatic": false,
"excludedPaths": [],
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"dataType": "Number",
"kind": "Range",
"precision": -1
},
{
"dataType": "String",
"kind": "Hash",
"precision": 3
}
]
},
{
"path": "/_isEdge/?",
"indexes": [
{
"dataType": "String",
"kind": "Hash",
"precision": 3
},
{
"dataType": "Number",
"kind": "Range",
"precision": -1
}
]
}
],
"indexingMode": "consistent"
},
With Cosmos, graph statements are not executed as traversals on the Azure side. The graph client actually translates gremlin statements into Document SQL calls and then aggregates the results back to you on the client side. In the case of your statement g.V().hasLabel('Azure') the call is actually translated to {"query":"SELECT N_2 FROM Node N_2 WHERE (IS_DEFINED(N_2._isEdge) = false AND (N_2.label = 'Azure'))"}
This can be verified through the use of a proxy such as Fiddler which will allow you to inspect the outbound calls from your machine.
The top level _isEdge property seems to be used across almost all Gremlin translated queries so I suspect that if you add that property to your indexing policy you should start to see the expected results.
EDIT:
I originally missed the part of your indexing policy that sets automatic: false. According to the Cosmos docs (under the heading Opting in and opting out of indexing), By default, all documents are automatically indexed, but you can choose to turn it off. When indexing is turned off, documents can be accessed only through their self-links or by queries using ID.
If you choose to run with indexing turned off, then the rest of your indexing policy is effectively meaningless and queries that aren't directly by document Id will no longer work. Can you elaborate as to what you're actually trying to accomplish here? There seems to be a bit of confusion. The indexing settings you've placed on label and isEdge aren't even necessary because they are the same as the value you've put for * which is the default rule matching all paths.
Post what your goals are for your indexing strategy and I can try to make an appropriate recommendation but you're definitely going to want to put automatic: true back into your policy.
My data looks something like this:
{
"id": "a06b42cf-d130-459a-8c89-dab77966747c",
"propertyBag": {
"Fixed": {
"address": {
"locationName": "",
"addressLine1": "1 Microsoft Way",
"addressLine2": null,
"city": "Redmond",
"postalCode": "98052",
"subDivision": null,
"state": "WA",
"country": "USA",
"location": {
"type": "Point",
"coordinates": [
47.640049,
-122.129797
]
}
},
}
}
}
Now when I try to query something like this
SELECT * FROM V v
WHERE ST_DISTANCE(v.propertyBag.Fixed.address.location, {
"type": "Point",
"coordinates": [47.36, -122.19]
}) < 100 * 1000
The results are always empty. I was wondering if someone can please let me know what maybe wrong?
I suspect that you just have the logitude and latitude transposed. Because if I change the document to:
"location": {
"type": "Point",
"coordinates": [-122.129797, 47.640049]
}
And I run this query:
SELECT
ST_DISTANCE(v.propertyBag.Fixed.address.location, {
"type": "Point",
"coordinates": [-122.19, 47.36]
})
FROM v
I get a result, but if I run it the way you show, I get no results.
In GeoJSON, points are specified with [longitude, latitude] to make it match with our normal expectations of x being east-west, and y being north-south. Unfortunately, this is the opposite of the traditional way of showing GEO coordinates.
-122 is not a valid value for latitude. The range for latitude is -90 to +90. Longitude is specified -180 to +180.
If your database is already populated and you don't feel like migrating it, then you could use a user defined function (UDF) to fix it during the query but I would strongly recommend doing the migration over this approach because geo-spacial indexes won't work as you have it now and your queries will be much slower as a result.
Again, I don't recommend this unless a GEO index is not important, but here is a swapXY UDF that will do the swap:
function(point) {
return {
type: "Point",
coordinates: [point.coordinates[1], point.coordinates[0]]
};
}
You use it in a query like this:
SELECT * FROM v
WHERE
ST_DISTANCE(
udf.swapXY(v.propertyBag.Fixed.address.location),
udf.swapXY({
"type": "Point",
"coordinates": [47.36, -122.19]
})
) < 100 * 1000