Query to get all Cosmos DB documents referenced by another - azure

Assume I have the following Cosmos DB container with the possible doc type partitions:
{
"id": <string>,
"partitionKey": <string>, // Always "item"
"name": <string>
}
{
"id": <string>,
"partitionKey": <string>, // Always "group"
"items": <array[string]> // Always an array of ids for items in the "item" partition
}
I have the id of a "group" document, but I do not have the document itself. What I would like to do is perform a query which gives me all "item" documents referenced by the "group" document.
I know I can perform two queries: 1) Retrieve the "group" document, 2) Perform a query with IN clause on the "item" partition.
As I don't care about the "group" document other than getting the list of ids, is it possible to construct a single query to get me all the "item" documents I want with just the "group" document id?

You'll need to perform two queries, as there are no joins between separate documents. Even though there is support for subqueries, only correlated subqueries are currently supported (meaning, the inner subquery is referencing values from the outer query). Non-correlated subqueries are what you'd need.
Note that, even though you don't want all of the group document, you don't need to retrieve the entire document. You can project just the items property, which can then be used in your 2nd query, with something like array_contains(). Something like:
SELECT VALUE g.items
FROM g
WHERE g.id="1"
AND g.partitionKey="group"
SELECT VALUE i.name
FROM i
WHERE array_contains(<items-from-prior-query>,i.id)
AND i.partitionKey="item"
This documentation page clarifies the two subquery types and support for only correlated subqueries.

Related

Cosmos db null value

I have two kind of record mention below in my table staudentdetail of cosmosDb.In below example previousSchooldetail is nullable filed and it can be present for student or not.
sample record below :-
{
"empid": "1234",
"empname": "ram",
"schoolname": "high school ,bankur",
"class": "10",
"previousSchooldetail": {
"prevSchoolName": "1763440",
"YearLeft": "2001"
} --(Nullable)
}
{
"empid": "12345",
"empname": "shyam",
"schoolname": "high school",
"class": "10"
}
I am trying to access the above record from azure databricks using pyspark or scala code .But when we are building the dataframe reading it from cosmos db it does not bring previousSchooldetail detail in the data frame.But when we change the query including id for which the previousSchooldetail show in the data frame .
Case 1:-
val Query = "SELECT * FROM c "
Result when query fired directly
empid
empname
schoolname
class
Case2:-
val Query = "SELECT * FROM c where c.empid=1234"
Result when query fired with where clause.
empid
empname
school name
class
previousSchooldetail
prevSchoolName
YearLeft
Could you please tell me why i am not able to get previousSchooldetail in case 1 and how should i proceed.
As #Jayendran, mentioned in the comments, the first query will give you the previouschooldetail document wherever they are available. Else, the column would not be present.
You can have this column present for all the scenarios by using the IS_DEFINED function. Try tweaking your query as below:
SELECT c.empid,
c.empname,
IS_DEFINED(c.previousSchooldetail) ? c.previousSchooldetail : null
as previousSchooldetail,
c.schoolname,
c.class
FROM c
If you are looking to get the result as a flat structure, it can be tricky and would need to use two separate queries such as:
Query 1
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
p.prevSchoolName,
p.YearLeft
FROM c JOIN c.previousSchooldetail p
Query 2
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
null as prevSchoolName,
null as YearLeft
FROM c
WHERE not IS_DEFINED (c.previousSchooldetail) or
c.previousSchooldetail = null
Unfortunately, Cosmos DB does not support LEFT JOIN or UNION. Hence, I'm not sure if you can achieve this in a single query.
Alternatively, you can create a stored procedure to return the desired result.

Count and data in single query in Azure Cosmos DB

I want to return the count and data by writing it in a single Cosmos sql query.
Something like
Select *, count() from c
Or if possible i want get the count in a json document.
[
{
"Count" : 1111
},
{
"Name": "Jon",
"Age" : 30
}
]
You're going to have to issue two separate queries - one to get the total number of documents matching your query, and a second to get a page of documents.

Cosmos: DISTINCT results with JOIN and ORDER BY

I'm trying to write a query that uses a JOIN to perform a geo-spatial match against locations in a array. I got it working, but added DISTINCT in order to de-duplicate (Query A):
SELECT DISTINCT VALUE
u
FROM
u
JOIN loc IN u.locations
WHERE
ST_WITHIN(
{'type':'Point','coordinates':[loc.longitude,loc.latitude]},
{'type':'Polygon','coordinates':[[[-108,-43],[-108,-40],[-110,-40],[-110,-43],[-108,-43]]]})
However, I then found that combining DISTINCT with continuation tokens isn't supported unless you also add ORDER BY:
System.ArgumentException: Distict query requires a matching order by in order to return a continuation token. If you would like to serve this query through continuation tokens, then please rewrite the query in the form 'SELECT DISTINCT VALUE c.blah FROM c ORDER BY c.blah' and please make sure that there is a range index on 'c.blah'.
So I tried adding ORDER BY like this (Query B):
SELECT DISTINCT VALUE
u
FROM
u
JOIN loc IN u.locations
WHERE
ST_WITHIN(
{'type':'Point','coordinates':[loc.longitude,loc.latitude]},
{'type':'Polygon','coordinates':[[[-108,-43],[-108,-40],[-110,-40],[-110,-43],[-108,-43]]]})
ORDER BY
u.created
The problem is, the DISTINCT no longer appears to be taking effect because it returns, for example, the same record twice.
To reproduce this, create a single document with this data:
{
"id": "b6dd3e9b-e6c5-4e5a-a257-371e386f1c2e",
"locations": [
{
"latitude": -42,
"longitude": -109
},
{
"latitude": -42,
"longitude": -109
}
],
"created": "2019-03-06T03:43:52.328Z"
}
Then run Query A above. You will get a single result, despite the fact that both locations match the predicate. If you remove the DISTINCT, you'll get the same document twice.
Now run Query B and you'll see it returns the same document twice, despite the DISTINCT clause.
What am I doing wrong here?
Reproduced your issue indeed,based on my researching,it seems a defect in cosmos db distinct query. Please refer to this link:Provide support for DISTINCT.
This feature is broke in the data explorer. Because cosmos can only
return 100 results per page at a time, the distinct keyword will only
apply to a single page. So, if your result set contains more than 100
results, you may still get duplicates back - they will simply be on
separately paged result sets.
You could describe your own situation and vote up this feedback case.

Azure CosmosDB: how to ORDER BY id?

Using a vanilla CosmosDB collection (all default), adding documents like this:
{
"id": "3",
"name": "Hannah"
}
I would like to retrieve records ordered by id, like this:
SELECT c.id FROM c
ORDER BY c.id
This give me the error Order-by item requires a range index to be defined on the corresponding index path.
I expect this is because /id is hash indexed and not range indexed. I've tried to change the Indexing Policy in various ways, but any change I make which would touch / or /id gets wiped when I save.
How can I retrieve documents ordered by ID?
The best way to do this is to store a duplicate property e.g. id2 that has the same value of id, and is indexed using a range index, then use that for sorting, i.e. query for SELECT * FROM c ORDER BY c.id2.
PS: The reason this is not supported is because id is part of a composite index (which is on partition key and row key; id is the row key part) The Cosmos DB team is working on a change that will allow sorting by id.
EDIT: new collections now support ORDER BY c.id as of 7/12/19
I found this page CosmosDB Indexing Policies , which has the below Note that may be helpful:
Azure Cosmos DB returns an error when a query uses ORDER BY but
doesn't have a Range index against the queried path with the maximum
precision.
Some other information from elsewhere in the document:
Range supports efficient equality queries, range queries (using >, <,
>=, <=, !=), and ORDER BY queries. ORDER By queries by default also require maximum index precision (-1). The data type can be String or
Number.
Some guidance on types of queries assisted by Range queries:
Range Range over /prop/? (or /) can be used to serve the following
queries efficiently:
SELECT FROM collection c WHERE c.prop = "value"
SELECT FROM collection c WHERE c.prop > 5
SELECT FROM collection c ORDER BY c.prop
And a code example from the docs also:
var rangeDefault = new DocumentCollection { Id = "rangeCollection" };
// Override the default policy for strings to Range indexing and "max" (-1) precision
rangeDefault.IndexingPolicy = new IndexingPolicy(new RangeIndex(DataType.String) { Precision = -1 });
await client.CreateDocumentCollectionAsync(UriFactory.CreateDatabaseUri("db"), rangeDefault);
Hope this helps,
J

Why does a Azure Cosmos query sorting by timestamp (string) cost so much more than by _ts (built in)?

This query cost 265 RU/s:
SELECT top 1 * FROM c
WHERE c.CollectPackageId = 'd0613cbb-492b-4464-b66b-3634b5571826'
ORDER BY c.StartFetchDateTimeUtc DESC
StartFetchDateTimeUtc is a string property, serialized by using the Cosmos API
This query cost 5 RU/s:
SELECT top 1 * FROM c
WHERE c.CollectPackageId = 'd0613cbb-492b-4464-b66b-3634b5571826'
ORDER BY c._ts DESC
_ts is a built in field, a Unix-based numeric timestamp.
Example result (only including this field and _ts):
"StartFetchDateTimeUtc": "2017-08-08T03:35:04.1654152Z",
"_ts": 1502163306
The index is in place and follows the suggestions & tutorials how to configure a sortable string/timestamp. It looks like:
{
"path": "/StartFetchDateTimeUtc/?",
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
According to this article, the "Item size,Item property count,Data consistency,Indexed properties,Document indexing,Query patterns,Script usage" variables will affect the RU.
So it is very strange that different property costs different RU.
I also create a test demo on my side(with your index and same document property). I have inserted 1000 records to the documentdb. The two different query costs same RU. I suggest you could start a new collection and test again.
The result is like this:
Order by StartFetchDateTimeUtc
Order by _ts

Resources