Spatial Indexes with DocumentDB - azure

I’m trying to do a spatial query against DocumentDB that looks like this:
SELECT * FROM root r WHERE
ST_WITHIN({'type':'Point','coordinates':[-122.02625, 37.4718]}, r.boundingBox)
to match a document that looks like this in the collection:
{
"userId": "747941cfb829",
"id": "747941cfb829_1453640096710",
"boundingBox": {
"type": "Polygon",
"coordinates": [
[-122.0263, 37.9718],
[-122.0262, 37.9718],
[-122.0262, 36.9718],
[-122.0263, 36.9718],
[-122.0263, 37.9718]
]
},
"distance": 0,
"duration": 1
}
I’ve turned on spatial indexes ala https://azure.microsoft.com/en-us/documentation/articles/documentdb-geospatial/ but I’m not getting a match back from DocumentDB.
Any ideas?
NOTE: Corrected GeoJson coordinate order.

The correct specification of a GeoJSON polygon has an additional array around the coordinates than you show to allow for the possibility of holes and multipolygons. So, it would look like this:
{
"type": "Polygon",
"coordinates": [
[
[0, 0], [10, 10], [10, 0], [0, 0]
]
]
}

Related

Substring filtering in Altair / using "params"

I am using Altair and would like to filter data using a substring search. Here is an example of doing it in vega-lite. Here is the code:
{
"config": {"view": {"continuousWidth": 400, "continuousHeight": 300}},
"data": {"name": "d"},
"mark": "point",
"encoding": {
"x": {"type": "quantitative", "field": "xval", "scale":{"domain": [0,4]}},
"y": {"type": "quantitative", "field": "yval", "scale":{"domain": [1,10]}}
},
"params": [{"name": "Letter", "value": "A",
"bind": {"input": "select", "options": ["A", "B", "C", "D", "E", "F"]}
}],
"transform": [
{"filter": "indexof(datum.info, Letter)>-1"}
],
"datasets": {
"d": [
{"xval": 1, "yval": 7, "info": "A;B;D;E"},
{"xval": 2, "yval": 2, "info": "A;C;E;F"},
{"xval": 3, "yval": 9, "info": "A;B;D"}
]
}
}
This allows me to filter out rows that contain "A", "B", "C" etc. in the info column, but it relies on "params" which is not available in Altair yet - is there any other way of achieving this kind of "substring" filtering in Altair as of now? This is meant to be a minimal example, but I have a large number of "options" (many gene names) in my actual use case, so adding a column for each to the original data wouldn't be feasible.
Trying to do this in Altair because it is for an executable research article which I believe allows Altair but not vega-lite.
Edit: realized that indexing like infoSel.info[0] gives the string of the selection from the dropdown. This still worked with infoSel.info (with no index) but that was just lucky - in expressions like this doing infoSel.info[0] is more correct.
Got it! This is possible with an expression in transform_filter, which I had previously tried but done incorrectly (I was using the name of the dropdown, not the name of the select object):
d = pd.DataFrame({'xval': [1, 2, 3],
'yval': [7, 2, 9],
'info': ['A;B;D;E', 'A;C;E;F', 'B;D']})
info_dropdown = alt.binding_select(options=['A', 'B', 'C', 'D', 'E', 'F'], name='Letter')
info_sel = alt.selection_single(name='infoSel', fields=['info'], bind=info_dropdown, init={'info': 'A'})
alt.Chart(d).mark_circle().encode(
x='xval', y='yval'
).add_selection(info_sel).transform_filter('indexof(datum.info, infoSel.info[0])>-1')

Azure Form Recognizer Not Behaving As Expected

I am having an issue with FormRecognizer not behaving how I have seen it should. Here is the dilemma
I have an Invoice that, when run through https://{endpoint}/formrecognizer/v2.0/layout/analyze
it recognized the table in the Invoice and generates the proper JSON with the "tables" node. Here is an example of part of it
{
"rows": 8,
"columns": 8,
"cells": [
{
"rowIndex": 0,
"columnIndex": 4,
"columnSpan": 3,
"text": "% 123 F STREET Deer Park TX 71536",
"boundingBox": [
3.11,
2.0733
],
"elements": [
"#/readResults/0/lines/20/words/0",
"#/readResults/0/lines/20/words/1"
]
}
When I train a model with NO labels file https://{endpoint}/formrecognizer/v2.0/custom/models It does not generate an empty "tables" node, but it generates (tokens). Here is an example of the one above without "table"
{
"key": {
"text": "__Tokens__12",
"boundingBox": null,
"elements": null
},
"value": {
"text": "123 F STREET",
"boundingBox": [
5.3778,
2.0625,
6.8056,
2.0625,
6.8056,
2.2014,
5.3778,
2.2014
],
"elements": null
},
"confidence": 1.0
}
I am not sure exactly where this is not behaving how intended, but any insight would be appreciated!
If you train a model WITH labeling files, then call FR Analyze(), the FR service will call the Layout service, which returns tables in "pageResults" section.

Azure Gremlin edge traversal suspiciously high (Out() step) RU cost

I have a weird issue, where doing an out-operation on a few edges causes my RU cost to triple. Hope someone can help me shed light on why + what I can do to mitigate it.
I have a Graph in CosmosDB, where there are two types of vertex labels: "Profile" and "Score". Each profile has 0 or 1 score-vertices via a "ProfileHasAggregatedScore" edge. The partitionKey is the ID of the Profile.
If I make the following queries, the RU currently is:
g.V().hasLabel('Profile').out('ProfileHasAggregatedScore')
>78 RU (8 scores found)
And for reference, the cost of getting all vertices of a type is:
g.V().hasLabel('Profile')
>28 RU (110 profiles found)
g.E().hasLabel('ProfileHasAggregatedScore')
>11 RU (8 edges found)
g.V().hasLabel('AggregatedRating')
>11 RU (8 scores found)
And the cost of a single of the vertices or edges are:
g.V('aProfileId').hasLabel('Profile')
>4 RU (1 found)
g.E('anEdgeId')
> 7RU
G.V('aRatingId')
> 3.5 RU
Can someone please help me as to why, making a traversal with only a few vertices along the way (see traversal at the bottom), is more expensive than searching for everything? And is there something I can do to prevent it? Adding a has-filter with the partitionKey does not seem to help. It seems odd that traversing/finding 16 elements more (8 edges and 8 vertices) after finding 110 vertices triples the cost of the operation?
(NB. With 1000 profiles the cost of doing 1 traversal along an edge to the score node is 2200 RU. This seems high, considering the emphasis their Azure team put on it being scalable?)
Traversal if it can help (It seems most of the time is spent finding the edges with the out() step):
[
{
"gremlin": "g.V().hasLabel('Profile').out('ProfileHasAggregatedScore').executionProfile()",
"totalTime": 46,
"metrics": [
{
"name": "GetVertices",
"time": 13,
"annotations": {
"percentTime": 28.26
},
"counts": {
"resultCount": 110
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 110,
"size": 124649,
"time": 2.47
}
]
},
{
"name": "GetEdges",
"time": 26,
"annotations": {
"percentTime": 56.52
},
"counts": {
"resultCount": 8
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 8,
"size": 5200,
"time": 6.22
},
{
"fanoutFactor": 1,
"count": 0,
"size": 49,
"time": 0.88
}
]
},
{
"name": "GetNeighborVertices",
"time": 7,
"annotations": {
"percentTime": 15.22
},
"counts": {
"resultCount": 8
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 8,
"size": 6303,
"time": 1.18
}
]
},
{
"name": "ProjectOperator",
"time": 0,
"annotations": {
"percentTime": 0
},
"counts": {
"resultCount": 8
}
}
]
}
]
enter code here

MongoDB $geoIntersects cannot find Polygons that contain a given Point

I have a document in the database that contains a "polygon" property that, as far as I can tell, is a valid GeoJSON object. I want to search the database using a GeoJSON Point object to find documents where the polygon in the polygon property contains the Point. To do so, I am using the $geoIntersects operator, however, whenever I preform the find, MongoDB returns the error: [Error: Can't use $geoIntersects].
The only object in the database:
{
"_id": ObjectId("581540795fd2da1b188eb09c"),
"name":"String",
"polygon":{
"coordinates":[
[ -90, -180 ],
[ 90, -180 ],
[ 90, 180 ],
[ -90, 180 ],
[ -90, -180 ]
],
"_id": ObjectId("581540795fd2da1b188eb09d"),
"name": "String",
"type": "Polygon"
},
"__v":0
}
I am using mongoose to search the database. The query that is used to preform the search:
{
polygon: {
$geoIntersects: {
$geometry: {
type: 'Point',
coordinates: [<long>, <lat>]
}
}
}
}
If I set the latitude and longitude to something simple, say (0, 0) or (1, 1). It returns the error. According to what I read elsewhere, the only reason this error should be returned is because the documents in the database are not valid GeoJSON objects, but I cannot see anything wrong with the only object in the database.
I think you're missing the outer array on your coordinates for a geoJSON polygon. Your document should look like this:
{
"_id": ObjectId("581540795fd2da1b188eb09c"),
"name":"String",
"polygon":{
"coordinates":[[
[ -90, -180 ],
[ 90, -180 ],
[ 90, 180 ],
[ -90, 180 ],
[ -90, -180 ]
]],
"_id": ObjectId("581540795fd2da1b188eb09d"),
"name": "String",
"type": "Polygon"
},
"__v":0
}
From the geoJSON spec, http://geojson.org/geojson-spec.html#id4
Coordinates of a Polygon are an array of LinearRing coordinate arrays. The first element in the array represents the exterior ring. Any subsequent elements represent interior rings (or holes).

MongoDB update query for nest array

Having collection Measurement such as shown below:
{
"Data" : [ [-5, [[1, 1023.0], [2, 694.0]]], [-1, [[1, 0.0], [2, 20.0]]], [-3, [[1, 30.75], [2, 30.75]]] ]
}
it reflects c# structure of Dictionary<int, Dictionary<int, double>> - what I'd need to do is to write an update script which will add 5 to all the parental dictionary keys. How could this be done via mongo update script? So it would turn the object to look as follows:
{
"Data" : [ [0, [[1, 1023.0], [2, 694.0]]], [4, [[1, 0.0], [2, 20.0]]], [2, [[1, 30.75], [2, 30.75]]] ]
}
The only way to do this is programatically, i.e., looping over the Data array and updating each individually.
This is probably not the structure that you really want if you need to update things in this way. The problem lies with the ability to match elements in a nested array in that the current limitation is that you can only match the first position and reference that index only when doing an update.
We can't tell much about your purpose based on what you have presented, but what you probably need is something like this:
{
"Data" : [
{
"pos": 0,
"ref": -5,
"A": { "x": 1, "y": 1023.0 },
"B": { "x": 2, "y": 694.0 }
},
{
"pos": 1,
"ref": -1,
"A": { "x": 1, "y": 0.0},
"B": { "x": 2, "y": 20.0 }
},
{
"pos": 2,
"ref": -3,
"A": { "x": 1, "y": 30.75 },
"B": { "x": 2, "y": 30.75 }
}
]
}
Yet even that does not allow you to update in a single query. You can do it with one for each element though:
db.collection.update({"_id": id, "Data.pos": 0}, {"$inc":{"Data.$.ref": 5}});
db.collection.update({"_id": id, "Data.pos": 1}, {"$inc":{"Data.$.ref": 5}});
db.collection.update({"_id": id, "Data.pos": 3}, {"$inc":{"Data.$.ref": 5}});
And your current schema would not allow you to do even that. And at least all of the elements could be accessed in this way, which again they could not before.
In any case, updating all of the array elements at once is not possible other than in a loop:
db.collection.find({ "_id": id }).forEach(function(doc) {
doc.Data.forEach(function(data) {
data.ref += 5;
});
db.collection.update(
{ "_id": doc._id },
{ "$set": { "Data": doc.Data } }
);
})
Or some variant that might even do something like the first example rather that just replacing the whole array as this does. Your current structure would rely on looping through several nested arrays to do the same thing.
Of course if you regularly have to update all elements in this way, then consider something other than an array. Or live with how you have to update, according to what your data access needs are.
Read the documentation on how things can be handled and make you decisions from there.

Resources