Azure RUs on cosmosDB - azure

I am trying to find out how RUs are working in order to optimize the requests made to the DB.
I have a simple query where is select by id
SELECT * FROM c WHERE c.id='cl0'
That query costs 277.08 RUs
Then I have another query where I select by another property
SELECT * FROM c WHERE c.name[0].id='35bfea78-ccda-4cc5-9539-bd7ff1dd474b'
That query costs 2.95 RUs
I cant figure out why there is that a big a difference in the consumed RUs between these two queries.
Thw two queries return the exact same result
[
{
"label": "class",
"id": "cl0",
"_id": "cl0",
"name": [
{
"_value": "C0.Iklos0",
"id": "35bfea78-ccda-4cc5-9539-bd7ff1dd474b"
}
],
"_rid": "6Ds6AJHyfgBfAAAAADFT==",
"_self": "dbs/6Ds4FA==/colls/6Ds6DFewfgA=/docs/6Ds6AJHyfgBdESFAAAAAAA==/",
"_etag": "\"00007200-0000-0000-0000-w3we73140000\"",
"_attachments": "attachments/",
"_ts": 1528722196
}
]

I faced similar issue previously so you are not the only person facing this issue. I provide you with two solutions.
1.sql SELECT * FROM c WHERE c.id='cl0' query documents across total database.If you could make a partition key to properly field, it will greatly improve your performance.
You could refer to this doc to know how to choose partition key.
2.I founded below answer in the thread: Azure DocumentDB Query by Id is very slow
Microsoft support responded and they've resolved the issue. They've added IndexVersion 2 for the collection. Unfortunately, it is not yet available from the portal and newly created accounts/collection are still not using the new version. You'll have to contact Microsoft Support to made changes to your accounts.
I suggest you committing feedback here to trace this announcement.
Hope it helps you.
-- Edit
To upgrade to index version 2 use the following code
var collection = (await client.ReadDocumentCollectionAsync(string.Format("/dbs/{0}/colls/{1}", databaseId, collectionId))).Resource;
collection.SetPropertyValue("IndexVersion", 2);
var replacedCollection = await client.ReplaceDocumentCollectionAsync(collection);

RU consumption depends on your document size and your query, I will highly recommend below link to query metrics. If you want to tune your query or want to understand latency check query Feed Details
x-ms-documentdb-query-metrics:
totalExecutionTimeInMs=33.67;queryCompileTimeInMs=0.06;queryLogicalPlanBuildTimeInMs=0.02;queryPhysicalPlanBuildTimeInMs=0.10;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=32.56;indexLookupTimeInMs=0.36;documentLoadTimeInMs=9.58;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=2000;retrievedDocumentSize=1125600;outputDocumentCount=2000;writeOutputTimeInMs=18.10;indexUtilizationRatio=1.00
x-ms-request-charge: 604.42
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-metrics

Related

Cannot identify the correct COSMOS DB SQL SELECT syntax to check if coordinates (Point) are within a Polygon

I m developing an app that uses Cosmos DB SQL. The intention is to identify if a potential construction site is within various restricted zones, such as national parks and sites of special scientific interest. This information is very useful in obtaining all the appropriate planning permissions.
I have created a container named 'geodata' containing 15 documents that I imported using data from a Geojson file provided by the UK National Parks. I have confirmed that all the polygons are valid using a ST_ISVALIDDETAILED SQL statement. I have also checked that the coordinates are anti-clockwise. A few documents contain MultiPolygons. The Geospatial Configuration of the container is 'Geography'.
I am using the Azure Cosmos Data Explorer to identify the correct format of a SELECT statement to identify if given coordinates (Point) are within any of the polygons within the 15 documents.
SELECT c.properties.npark18nm
FROM c
WHERE ST_WITHIN({"type": "Point", "coordinates":[-3.139638969259495,54.595188276959284]}, c.geometry)
The embedded coordinates are within a National Park, in this case, the Lake District in the UK (it also happens to be my favourite coffee haunt).
'c.geometry' is the JSON field within the documents.
"type": "Feature",
"properties": {
"objectid": 3,
"npark18cd": "E26000004",
"npark18nm": "Northumberland National Park",
"npark18nmw": " ",
"bng_e": 385044,
"bng_n": 600169,
"long": -2.2370801,
"lat": 55.29539871,
"st_areashape": 1050982397.6985701,
"st_lengthshape": 339810.592994494
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-2.182235310191206,
55.586659699934806
],
[
-2.183754259805564,
55.58706479201416
], ......
Link to the full document: https://www.dropbox.com/s/yul6ft2rweod75s/lakedistrictnationlpark.json?dl=0
I have not been able to format the SELECT query to return the name of the park successfully.
Can you help me?
Is what I want to achieve possible?
Any guidance would be appreciated.
I appreciate any help you can provide.
You haven't mentioned what error you are getting. And you have misspelt "c.geometry".
This should work
SELECT c.properties.npark18nm
FROM c
WHERE ST_WITHIN({"type": "Point", "coordinates": [-3.139638969259495,54.595188276959284]}, c.geometry)
When running the query with your sample document, I was able to get the correct response(see image).
So this particular document is fine and the query in your question works too. Can you recheck your query on the explorer again? Also, are you referring to the incorrect database/collection by any chance?
Maybe a full screen shot of cosmos data explorer showing the dbs/collections, your query and response will also help.
I have fixed this problem. Not by altering the SQL statement but by deleting the container the data was held in, recreating it and reloading the data.
The SQL statement now produces the expected results.

Case insensitive search in arrays for CosmosDB / DocumentDB

Lets say I have these documents in my CosmosDB. (DocumentDB API, .NET SDK)
{
// partition key of the collection
"userId" : "0000-0000-0000-0000",
"emailAddresses": [
"someaddress#somedomain.com", "Another.Address#someotherdomain.com"
]
// some more fields
}
I now need to find out if I have a document for a given email address. However, I need the query to be case insensitive.
There are ways to search case insensitive on a field (they do a full scan however):
How to do a Case Insensitive search on Azure DocumentDb?
select * from json j where LOWER(j.name) = 'timbaktu'
e => e.Id.ToLower() == key.ToLower()
These do not work for arrays. Is there an alternative way? A user defined function looks like it could help.
I am mainly looking for a temporary low-effort solution to support the scenario (I have multiple collections like this). I probably need to switch to a data structure like this at some point:
{
"userId" : "0000-0000-0000-0000",
// Option A
"emailAddresses": [
{
"displayName": "someaddress#somedomain.com",
"normalizedName" : "someaddress#somedomain.com"
},
{
"displayName": "Another.Address#someotherdomain.com",
"normalizedName" : "another.address#someotherdomain.com"
}
],
// Option B
"emailAddressesNormalized": {
"someaddress#somedomain.com", "another.address#someotherdomain.com"
}
}
Unfortunately, my production database already contains documents that would need to be updated to support the new structure.
My production collections contain only 100s of these items, so I am even tempted to just get all items and do the comparison in memory on the client.
If performance matters then you should consider one of the normalization solution you have proposed yourself in question. Then you could index the normalized field and get results without doing a full scan.
If for some reason you really don't want to retouch the documents then perhaps the feature you are missing is simple join?
Example query which will do case-insensitive search from within array with a scan:
SELECT c FROM c
join email in c.emailAddresses
where lower(email) = lower('ANOTHER.ADDRESS#someotherdomain.com')
You can find more examples about joining from Getting started with SQL commands in Cosmos DB.
Note that where-criteria in given example cannot use an index, so consider using it only along another more selective (indexed) criteria.

What does getContext.getCollection() return in CosmosDB stored procedure?

I have written a simple stored procedure to query a collection and return response but when I execute it as node.js script I get 400 error code and following error message:
"PartitionKey extracted from document doesn't match the one specified in the header"
When getContext().getCollection.getSelfLink() value is printed I get dbs/3Mk0AA==/colls/3Mk0AOWMbw0=/ but my database and collection Ids are some other values.
Any kind of help will be much appreciated.
When you observe the documents you created in Azure Cosmos DB, you will see several system generated properties in addition to the ID you set.
You could find official statement from System vs. user defined resources.
{
"id": "1",
"statusId": "new",
"_rid": "duUuAN3LzQEIAAAAAAAAAA==",
"_self": "dbs/duUuAA==/colls/duUuAN3LzQE=/docs/duUuAN3LzQEIAAAAAAAAAA==/",
"_etag": "\"0400d4ee-0000-0000-0000-5a24ac3f0000\"",
"_attachments": "attachments/",
"_ts": 1512352831
}
getContext().getCollection.getSelfLink() method returns "_self" value, not Id value you set.
PartitionKey extracted from document doesn't match the one specified
in the header
This issue should be due to you set the PartitionKey incorrectly.
Suppose your partitioning key is color and there are two partitions red and blue in the database. The PK should be set red or blue, not color.
You could refer to a similar thread I answered before : How to specify NONE partition key for deleting a document in Document DB java SDK?
Hope it helps you.
Yes, Thanks for the help guys!!
Incase anyone else tries, adding partition key while executing the stored procedure worked for me. The code is,
client.executeStoredProcedure('/dbs/<database-id>/colls/<collection-id>/sprocs/<storedproc-id>', <input to the procedure(if any)>, { partitionKey: <partition-field-id> }, callback);

Aggregate query for IBM Cloudant which is basically couchDB

I am a contributor at http://airpollution.online/ which is open environment web platform built open source having IBM Cloudant as it's Database service.
Platform's architecture is such way that we need to fetch latest data of each air pollution measurement devices from a collection. As far as my experience go with MongoDB, I have wrote aggregate query to fetch each devices' latest data as per epoch time key in each and every document available in respective collection.
Sample Aggregate query is :
db.collection("hourly_analysis").aggregate([
{
$sort: {
"time": -1,
"Id": -1
}
}, {
$project: {
"Id": 1,
"data": 1,
"_id": 0
}
}, {
$group: {
"_id": "$Id",
"data": {
"$last": "$$ROOT"
}
}
}
If someone has idea/suggestions about how can I write design documents in IBM Cloudant, Please help me! Thanks!
P.S. We still have to make backend open source for this project. (may take some time)
In CouchDB/Cloudant this is usually better done as a view than an ad-hoc query. It's a tricky one but try this:
- a map step that emits the device ID and timestamp as two parts of a composite key, plus the device reading as the value
- a reduce step that looks for the largest timestamp and returns both the biggest (most recent) timestamp and the reading that goes with it (both values are needed because when rereducing, we need to know the timestamp so we can compare them)
- the view with group_level set to 1 will give you the newest reading for each device.
In most cases in Cloudant you can use the built-in reduce functions but here you want a function of a key.
(The way that I solved this problem previously was to copy incoming data into a "newest readings" store as well as writing it to a database in the normal way. This makes it very quick to access if you only ever want the newest reading.)

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources