Date Between Query in Cosmos DB - azure

I am in the building a simple event store in Cosmos DB that has documents that are structured something like this:
{
"id": "e4c2bbd0-2885-4fb5-bcca-90436f79f155",
"entityType": "contact",
"history": [
{
"startDate": 1504656000,
"endDate": 1504656000,
"Name": "John"
},
{
"startDate": 1504828800,
"endDate": 1504828800,
"Name": "Jon"
}
]
}
This might not bet the most efficient way to store it but this is what I am starting with. But I want to be able to query all contact documents out of the db for a certain period of time. The startDate and endDate represent the time the record was valid. The history currently contains the entire history of the record which probably could be improved.
I have tried creating a query like this:
SELECT c.entityType, c.id,history.Name, history.startDate FROM c
JOIN history in c.history
where
c.entityType = "contact" AND
(history.StartDate <= 1504656001
AND history.EndDate >= 1504656001)
This query should return the state of the contact for 9/7/2017 but instead it is returning every one of the history. I have played with several options but I am not sure what I am missing.
I have also tried setting the index (maybe that is the issue?) So I have included the indexing policy here:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
}
]
}
],
"excludedPaths": []
}
What am I missing? Is the index correct? Is my query correct for a date between query?

You have two issues. One is addressed by Matias in comment.
Second, your condition is history.StartDate <= 1504656001 AND history.EndDate >= 1504656001.
play with the range for e.g. history.StartDate >= 1504656001 AND history.EndDate <= 1504656111.

Related

Can I index an array in a composite index in Azure Cosmos DB?

I have a problem indexing an array in Azure Cosmos DB
I am trying to save this indexing policy via the portal
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/DeviceId",
"order": "ascending"
},
{
"path": "/TimeStamp",
"order": "ascending"
},
{
"path": "/Items/[]/Name/?",
"order": "ascending"
},
{
"path": "/Items/[]/DoubleValue/?",
"order": "ascending"
}
]
]
}
I get the error "Failed to update container DeviceEvents:
Message: {"code":"BadRequest","message":"Message: {"Errors":["The indexing path '\/Items\/[]\/Name\/?' could not be accepted, failed near position '8'."
This seems to be the array [] syntax that is giving an error.
On a side note I am not sure what I am doing makes sense at all but I have a query that looks like this
SELECT SUM(de0["DoubleValue"])
FROM root JOIN de0 IN root["Items"]
WHERE root["ApplicationId"] = 57 AND root["DeviceId"] = 126 AND root["TimeStamp"] >= "2021-02-21T17:55:29.7389397Z" AND de0["Name"] = "Use Case"
Where ApplicationId is the partition key and the item saved looks like this
{
"id": "59ab9323-26ca-436f-8d29-e1ddd826f025",
"DeviceId": 3,
"ApplicationId": 3,
"RawData": "640F7A000A00E30142000000",
"TimeStamp": "2021-02-20T18:36:52.833174Z",
"Items": [
{
"Name": "Battery Status",
"StringValue": "Full",
"DoubleValue": null
},
{
"Name": "Use Case",
"StringValue": null,
"DoubleValue": 12
},
{
"Name": "Battery Voltage",
"StringValue": null,
"DoubleValue": 3.962
},
{
"Name": "Rain Gauge Count",
"StringValue": null,
"DoubleValue": 10
}
],
"_rid": "CgdVAO7B0DNkAAAAAAAAAA==",
"_self": "dbs/CgdVAA==/colls/CgdVAO7B0DM=/docs/CgdVAO7B0DNkAAAAAAAAAA==/",
"_etag": "\"61008771-0000-0d00-0000-603156c50000\"",
"_attachments": "attachments/",
"_ts": 1613846213
}
I need to aggregate on some of these items in the array like say get MAX on temperature or something like this (using Use Case for test although it doesn't make sense). I reasoned that if all the data in the query is in a single composite index the database would be able to do the aggregation without reading the documents themselves. However I can't seem to add a composite index containing an array at all.
Yes, composite index can't contain an array path. It should be a scalar value.
Unlike with included or excluded paths, you can't create a path with
the /* wildcard. Every composite path has an implicit /? at the end of
the path that you don't need to specify. Composite paths lead to a
scalar value and this is the only value that is included in the
composite index.
Reference:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#composite-indexes

Strange query results in Azure Cosmos DB

I have following documents in my Azure Cosmos DB:
{
"id": "token",
"User": {
"UserToken": "token",
"Email": "test#email.com"
},
"_ts": 1541493290
}
When I run the following query:
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["Email"], "token"))
ORDER BY root["_ts"] DESC
Nothing is returned. But when I change it a bit. For example byconverting Email to email:
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["email"], "token"))
ORDER BY root["_ts"] DESC
The result is found. Moreover when I remove ORDER BY clause, also query returns me a result. So the query is like following
SELECT * FROM root
WHERE ((root["User"]["UserToken"] = "token")
OR CONTAINS(root["User"]["Email"], "token"))
Moreover, when I edit the document (like open it, add an empty line and save), some magic happens in the background and the document is found. For quite "new" documents (less than 1-3 months), I can search them without my "magic" trick.
Indexes definition is:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
What I did wrong?
UPDATE the answer is not a full explanation but it helps a lot. Full explanation is in my blog (https://stapp.space/ridiculous-bug-in-azure-cosmos-db/)
CONTAINS(root["User"]["Email"], "token") won't work if you have strings indexed as Hash. They need to be Range with -1 precision. Hash only works for equality checks.
That's why the lowercase one is working. Because it cannot find the property and it just ignores it, falling back to the equality check. The first one finds it, sees that it's not indexed as Range and it just fails to return.
Changing indexing to this, will work:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
],
"excludedPaths": []
}
On a side note, the _ts field is not the best way to do ordering based on creation. It is a unix timestamp in seconds, so any documents created in the same second won't be properly ordered.

Azure Cosmos DB geospatial lookup consuming too high RU

I have a single Azure Cosmos DB collection I am querying against, hoping to use Geo-spatial index for efficient queries. The problem I'm encountering is that the RU consumption seems inefficient.
The collection has only 50k 1kb documents in it, but a query using ST_DISTANCE returning a single document consumes >900 RUs.
I've seen the RUs scale linearly based on the # of documents in the collection. It would seem indexing should prevent this behavior.
Example Query (950 RUs):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [34.69, -1.91] }) < 500
Example document:
[
{
"id": "1504891036",
"name": "Oujda",
"location": {
"type": "Point",
"coordinates": [
34.69,
-1.91
]
},
"population": 409391,
"country": "Morocco",
"country.iso2": "MA",
"country.iso3": "MAR",
}
]
I've not modified the default indexing policy, which seems to cover spatial indexing:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
"excludedPaths": []
}
I determined the problem. I had transposed the longitude and the latitude coordinate prescribed by GeoJSON:
Cosmos is expecting:
"location": {
"type": "Point",
"coordinates": [
<#lon>,
<#lat>
]
I had assumed, incorrectly, that it was lat/lon. Therefore many of my latitudes were outside of the 90/-90 range required, since longitude can be 180/-180. After re-creating my ~50k documents, RU for coordinate based lookups are consistently <10 RUs.
Before fix (all docs have transposed lat/lon coordinates, many outside the 90/-90 bounds and therefore invalid):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [34.69, -1.91] }) < 500
940 RUs, 1 document returned
After fix (all docs re-created with lat/lon set correctly per GeoJSON specs):
SELECT * FROM c where ST_DISTANCE(c.location, { 'type': 'Point', 'coordinates': [-1.91,34.69] }) < 500
6 RUs, 1 document returned
Initial issue was confirmed/diagnosed by the following query:
SELECT ST_ISVALIDDETAILED(c.location) FROM c where c.name = "Kansas City"
Error: "Latitude values must be between -90 and 90 degrees."

Q: Azure Cosmos DB Graph: How to run queries in Graph API when Indexing Policy is defined as Manual?

In Cosmos DB graph when I am defining Indexing policy as Automatic, I am able to run queries but when I am updating indexing policy to Manual and defining Indexing path (/label/?) and Indexing mode set as 'Consistent', the query is not fetching any data.
Let's say my first query (when Indexing policy set as Manual) is :
g.addV('Azure').property('name','Cerulean Software'))
Result is :
[
{
"id": "0c14a00a-edf6-46b1-9e40-45cc37f750ea",
"label": "Azure",
"type": "vertex",
"properties": {
"name": [
{
"id": "f89ee2ee-74df-4256-a5d4-2b47eb526976",
"value": "Cerulean Software"
}
]
}
}
]
Now, my second query (when Indexing policy set as Manual (see Edit #1 below)) is:
g.V().hasLabel('Azure')
This second query is not fetching any result even though there is vertex present in graph named as 'Azure'.
What could be the possible reason behind this?
Edit #1: Manual Indexing Policy Before Change
"indexingPolicy": {
"automatic": false,
"excludedPaths": [],
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"dataType": "Number",
"kind": "Range",
"precision": -1
},
{
"dataType": "String",
"kind": "Hash",
"precision": 3
}
]
},
{
"path": "/label/?",
"indexes": [
{
"dataType": "String",
"kind": "Hash",
"precision": 3
},
{
"dataType": "Number",
"kind": "Range",
"precision": -1
}
]
}
],
"indexingMode": "consistent"
},
Edit #2: Manual Indexing Policy After Change
"indexingPolicy": {
"automatic": false,
"excludedPaths": [],
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"dataType": "Number",
"kind": "Range",
"precision": -1
},
{
"dataType": "String",
"kind": "Hash",
"precision": 3
}
]
},
{
"path": "/_isEdge/?",
"indexes": [
{
"dataType": "String",
"kind": "Hash",
"precision": 3
},
{
"dataType": "Number",
"kind": "Range",
"precision": -1
}
]
}
],
"indexingMode": "consistent"
},
With Cosmos, graph statements are not executed as traversals on the Azure side. The graph client actually translates gremlin statements into Document SQL calls and then aggregates the results back to you on the client side. In the case of your statement g.V().hasLabel('Azure') the call is actually translated to {"query":"SELECT N_2 FROM Node N_2 WHERE (IS_DEFINED(N_2._isEdge) = false AND (N_2.label = 'Azure'))"}
This can be verified through the use of a proxy such as Fiddler which will allow you to inspect the outbound calls from your machine.
The top level _isEdge property seems to be used across almost all Gremlin translated queries so I suspect that if you add that property to your indexing policy you should start to see the expected results.
EDIT:
I originally missed the part of your indexing policy that sets automatic: false. According to the Cosmos docs (under the heading Opting in and opting out of indexing), By default, all documents are automatically indexed, but you can choose to turn it off. When indexing is turned off, documents can be accessed only through their self-links or by queries using ID.
If you choose to run with indexing turned off, then the rest of your indexing policy is effectively meaningless and queries that aren't directly by document Id will no longer work. Can you elaborate as to what you're actually trying to accomplish here? There seems to be a bit of confusion. The indexing settings you've placed on label and isEdge aren't even necessary because they are the same as the value you've put for * which is the default rule matching all paths.
Post what your goals are for your indexing strategy and I can try to make an appropriate recommendation but you're definitely going to want to put automatic: true back into your policy.

How to search through data with arbitrary amount of fields?

I have the web-form builder for science events. The event moderator creates registration form with arbitrary amount of boolean, integer, enum and text fields.
Created form is used for:
register a new member to event;
search through registered members.
What is the best search tool for second task (to search memebers of event)? Is ElasticSearch well for this task?
I wrote a post about how to index arbitrary data into Elasticsearch and then to search it by specific fields and values. All this, without blowing up your index mapping.
The post is here: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/
In short, you will need to do the following steps to get what you want:
Create a special index described in the post.
Flatten the data you want to index using the flattenData function:
https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4.
Create a document with the original and flattened data and index it into Elasticsearch:
{
"data": { ... },
"flatData": [ ... ]
}
Optional: use Elasticsearch aggregations to find which fields and types have been indexed.
Execute queries on the flatData object to find what you need.
Example
Basing on your original question, let's assume that the first event moderator created a form with following fields to register members for the science event:
name string
age long
sex long - 0 for male, 1 for female
In addition to this data, the related event probably has some sort of id, let's call it eventId. So the final document could look like this:
{
"eventId": "2T73ZT1R463DJNWE36IA8FEN",
"name": "Bob",
"age": 22,
"sex": 0
}
Now, before we index this document, we will flatten it using the flattenData function:
flattenData(document);
This will produce the following array:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "2T73ZT1R463DJNWE36IA8FEN"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Bob"
},
{
"key": "age",
"type": "long",
"key_type": "age.long",
"value_long": 22
},
{
"key": "sex",
"type": "long",
"key_type": "sex.long",
"value_long": 0
}
]
Then we will wrap this data in a document as I've showed before and index it.
Then, the second event moderator, creates another form having a new field, field with same name and type, and also a field with same name but with different type:
name string
city string
sex string - "male" or "female"
This event moderator decided that instead of having 0 and 1 for male and female, his form will allow choosing between two strings - "male" and "female".
Let's try to flatten the data submitted by this form:
flattenData({
"eventId": "F1BU9GGK5IX3ZWOLGCE3I5ML",
"name": "Alice",
"city": "New York",
"sex": "female"
});
This will produce the following data:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "F1BU9GGK5IX3ZWOLGCE3I5ML"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Alice"
},
{
"key": "city",
"type": "string",
"key_type": "city.string",
"value_string": "New York"
},
{
"key": "sex",
"type": "string",
"key_type": "sex.string",
"value_string": "female"
}
]
Then, after wrapping the flattened data in a document and indexing it into Elasticsearch we can execute complicated queries.
For example, to find members named "Bob" registered for the event with ID 2T73ZT1R463DJNWE36IA8FEN we can execute the following query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "eventId"}},
{"match": {"flatData.value_string.keyword": "2T73ZT1R463DJNWE36IA8FEN"}}
]
}
}
}
},
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "name"}},
{"match": {"flatData.value_string": "bob"}}
]
}
}
}
}
]
}
}
}
ElasticSearch automatically detects the field content in order to index it correctly, even if the mapping hasn't been defined previously. So, yes : ElasticSearch suits well these cases.
However, you may want to fine tune this behavior, or maybe the default mapping applied by ElasticSearch doesn't correspond to what you need : in this case, take a look at the default mapping or, for even further control, the dynamic templates feature.
If you let your end users decide the keys you store things in, you'll have an ever-growing mapping and cluster state, which is problematic.
This case and a suggested solution is covered in this article on common problems with Elasticsearch.
Essentially, you want to have everything that can possibly be user-defined as a value. Using nested documents, you can have a key-field and differently mapped value fields to achieve pretty much the same.

Resources