Cannot identify the correct COSMOS DB SQL SELECT syntax to check if coordinates (Point) are within a Polygon - azure

I m developing an app that uses Cosmos DB SQL. The intention is to identify if a potential construction site is within various restricted zones, such as national parks and sites of special scientific interest. This information is very useful in obtaining all the appropriate planning permissions.
I have created a container named 'geodata' containing 15 documents that I imported using data from a Geojson file provided by the UK National Parks. I have confirmed that all the polygons are valid using a ST_ISVALIDDETAILED SQL statement. I have also checked that the coordinates are anti-clockwise. A few documents contain MultiPolygons. The Geospatial Configuration of the container is 'Geography'.
I am using the Azure Cosmos Data Explorer to identify the correct format of a SELECT statement to identify if given coordinates (Point) are within any of the polygons within the 15 documents.
SELECT c.properties.npark18nm
FROM c
WHERE ST_WITHIN({"type": "Point", "coordinates":[-3.139638969259495,54.595188276959284]}, c.geometry)
The embedded coordinates are within a National Park, in this case, the Lake District in the UK (it also happens to be my favourite coffee haunt).
'c.geometry' is the JSON field within the documents.
"type": "Feature",
"properties": {
"objectid": 3,
"npark18cd": "E26000004",
"npark18nm": "Northumberland National Park",
"npark18nmw": " ",
"bng_e": 385044,
"bng_n": 600169,
"long": -2.2370801,
"lat": 55.29539871,
"st_areashape": 1050982397.6985701,
"st_lengthshape": 339810.592994494
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-2.182235310191206,
55.586659699934806
],
[
-2.183754259805564,
55.58706479201416
], ......
Link to the full document: https://www.dropbox.com/s/yul6ft2rweod75s/lakedistrictnationlpark.json?dl=0
I have not been able to format the SELECT query to return the name of the park successfully.
Can you help me?
Is what I want to achieve possible?
Any guidance would be appreciated.
I appreciate any help you can provide.

You haven't mentioned what error you are getting. And you have misspelt "c.geometry".
This should work
SELECT c.properties.npark18nm
FROM c
WHERE ST_WITHIN({"type": "Point", "coordinates": [-3.139638969259495,54.595188276959284]}, c.geometry)
When running the query with your sample document, I was able to get the correct response(see image).
So this particular document is fine and the query in your question works too. Can you recheck your query on the explorer again? Also, are you referring to the incorrect database/collection by any chance?
Maybe a full screen shot of cosmos data explorer showing the dbs/collections, your query and response will also help.

I have fixed this problem. Not by altering the SQL statement but by deleting the container the data was held in, recreating it and reloading the data.
The SQL statement now produces the expected results.

Related

Azure RUs on cosmosDB

I am trying to find out how RUs are working in order to optimize the requests made to the DB.
I have a simple query where is select by id
SELECT * FROM c WHERE c.id='cl0'
That query costs 277.08 RUs
Then I have another query where I select by another property
SELECT * FROM c WHERE c.name[0].id='35bfea78-ccda-4cc5-9539-bd7ff1dd474b'
That query costs 2.95 RUs
I cant figure out why there is that a big a difference in the consumed RUs between these two queries.
Thw two queries return the exact same result
[
{
"label": "class",
"id": "cl0",
"_id": "cl0",
"name": [
{
"_value": "C0.Iklos0",
"id": "35bfea78-ccda-4cc5-9539-bd7ff1dd474b"
}
],
"_rid": "6Ds6AJHyfgBfAAAAADFT==",
"_self": "dbs/6Ds4FA==/colls/6Ds6DFewfgA=/docs/6Ds6AJHyfgBdESFAAAAAAA==/",
"_etag": "\"00007200-0000-0000-0000-w3we73140000\"",
"_attachments": "attachments/",
"_ts": 1528722196
}
]
I faced similar issue previously so you are not the only person facing this issue. I provide you with two solutions.
1.sql SELECT * FROM c WHERE c.id='cl0' query documents across total database.If you could make a partition key to properly field, it will greatly improve your performance.
You could refer to this doc to know how to choose partition key.
2.I founded below answer in the thread: Azure DocumentDB Query by Id is very slow
Microsoft support responded and they've resolved the issue. They've added IndexVersion 2 for the collection. Unfortunately, it is not yet available from the portal and newly created accounts/collection are still not using the new version. You'll have to contact Microsoft Support to made changes to your accounts.
I suggest you committing feedback here to trace this announcement.
Hope it helps you.
-- Edit
To upgrade to index version 2 use the following code
var collection = (await client.ReadDocumentCollectionAsync(string.Format("/dbs/{0}/colls/{1}", databaseId, collectionId))).Resource;
collection.SetPropertyValue("IndexVersion", 2);
var replacedCollection = await client.ReplaceDocumentCollectionAsync(collection);
RU consumption depends on your document size and your query, I will highly recommend below link to query metrics. If you want to tune your query or want to understand latency check query Feed Details
x-ms-documentdb-query-metrics:
totalExecutionTimeInMs=33.67;queryCompileTimeInMs=0.06;queryLogicalPlanBuildTimeInMs=0.02;queryPhysicalPlanBuildTimeInMs=0.10;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=32.56;indexLookupTimeInMs=0.36;documentLoadTimeInMs=9.58;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=2000;retrievedDocumentSize=1125600;outputDocumentCount=2000;writeOutputTimeInMs=18.10;indexUtilizationRatio=1.00
x-ms-request-charge: 604.42
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-metrics

What does getContext.getCollection() return in CosmosDB stored procedure?

I have written a simple stored procedure to query a collection and return response but when I execute it as node.js script I get 400 error code and following error message:
"PartitionKey extracted from document doesn't match the one specified in the header"
When getContext().getCollection.getSelfLink() value is printed I get dbs/3Mk0AA==/colls/3Mk0AOWMbw0=/ but my database and collection Ids are some other values.
Any kind of help will be much appreciated.
When you observe the documents you created in Azure Cosmos DB, you will see several system generated properties in addition to the ID you set.
You could find official statement from System vs. user defined resources.
{
"id": "1",
"statusId": "new",
"_rid": "duUuAN3LzQEIAAAAAAAAAA==",
"_self": "dbs/duUuAA==/colls/duUuAN3LzQE=/docs/duUuAN3LzQEIAAAAAAAAAA==/",
"_etag": "\"0400d4ee-0000-0000-0000-5a24ac3f0000\"",
"_attachments": "attachments/",
"_ts": 1512352831
}
getContext().getCollection.getSelfLink() method returns "_self" value, not Id value you set.
PartitionKey extracted from document doesn't match the one specified
in the header
This issue should be due to you set the PartitionKey incorrectly.
Suppose your partitioning key is color and there are two partitions red and blue in the database. The PK should be set red or blue, not color.
You could refer to a similar thread I answered before : How to specify NONE partition key for deleting a document in Document DB java SDK?
Hope it helps you.
Yes, Thanks for the help guys!!
Incase anyone else tries, adding partition key while executing the stored procedure worked for me. The code is,
client.executeStoredProcedure('/dbs/<database-id>/colls/<collection-id>/sprocs/<storedproc-id>', <input to the procedure(if any)>, { partitionKey: <partition-field-id> }, callback);

Aggregate query for IBM Cloudant which is basically couchDB

I am a contributor at http://airpollution.online/ which is open environment web platform built open source having IBM Cloudant as it's Database service.
Platform's architecture is such way that we need to fetch latest data of each air pollution measurement devices from a collection. As far as my experience go with MongoDB, I have wrote aggregate query to fetch each devices' latest data as per epoch time key in each and every document available in respective collection.
Sample Aggregate query is :
db.collection("hourly_analysis").aggregate([
{
$sort: {
"time": -1,
"Id": -1
}
}, {
$project: {
"Id": 1,
"data": 1,
"_id": 0
}
}, {
$group: {
"_id": "$Id",
"data": {
"$last": "$$ROOT"
}
}
}
If someone has idea/suggestions about how can I write design documents in IBM Cloudant, Please help me! Thanks!
P.S. We still have to make backend open source for this project. (may take some time)
In CouchDB/Cloudant this is usually better done as a view than an ad-hoc query. It's a tricky one but try this:
- a map step that emits the device ID and timestamp as two parts of a composite key, plus the device reading as the value
- a reduce step that looks for the largest timestamp and returns both the biggest (most recent) timestamp and the reading that goes with it (both values are needed because when rereducing, we need to know the timestamp so we can compare them)
- the view with group_level set to 1 will give you the newest reading for each device.
In most cases in Cloudant you can use the built-in reduce functions but here you want a function of a key.
(The way that I solved this problem previously was to copy incoming data into a "newest readings" store as well as writing it to a database in the normal way. This makes it very quick to access if you only ever want the newest reading.)

How to backup/dump structure of graphs in arangoDB

Is there a way to dump the graphstructure of an arangoDB database, since
arangodump unfortunately just dumps the data of edges and collections.
According to the documentation in order to dump structural information of all collections (including system collections) you run the following
arangodump --dump-data false --include-system-collections true --output-directory "dump"
If you do not want the system collections to be included then don't provide the argument (it defaults to false) or provide a false value.
How is the structural and data of collections dumped, see below from the documentation
Structural information for a collection will be saved in files with
name pattern .structure.json. Each structure file will contains a JSON
object with these attributes:
parameters: contains the collection properties
indexes: contains the collection indexes
Document data for a collection will be saved in
files with name pattern .data.json. Each line in a data file is a
document insertion/update or deletion marker, alongside with some meta
data.
For testing I often want to extract a subgraph with a known structure. I use that to test my queries against. The method is not pretty but it might address your question. I blogged about it here.
Although #Raf's answer is accepted , --dump-data false will only give structure files for all the collections and but data wont be there. Including --include-system-collections true would give _graphs system collection's structure which wont have information pertaining to individual graphs creation/structure.
For Graph creation data as well
Right command is as follows.
arangodump --server.database <DB_NAME> --include-system-collections true --output-directory <YOUR_DIRECTORY>
We would be interested in _graphs_<long_id>.data.json named file which has below data format.
{
"type": 2300,
"data":
{
"_id": "_graphs/social",
"_key": "social",
"_rev": "_WaJxhIO--_",
"edgeDefinitions": [
{
"collection": "relation",
"from": ["female", "male"],
"to": ["female", "male"]
}
],
"numberOfShards": 1,
"orphanCollections": [],
"replicationFactor": 1
}
}
Hope this helps other users who were looking for my requirement!
Currently ArangoDB manages graphs via documents in the system collection _graphs.
One document equals one graph. It contains the graph name, involved vertex collections and Edge Definition that configure the directions of edge collections.

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources