Questions on storing and searching geolocation data in Azure Storage - azure

Hey everyone I am building a website that needs to have the capability to store the physical location of a person and allow someone to search for other people within some radius of that location. For the sake of an example lets pretend it's a dating website (it's not) and so as a user you want to find people within 50 miles of your current location that meet some other set of search criteria.
Currently I am storing all of my user information in Azure Table Storage. However this is the first time I've ever attempted to create a geo-aware search algorithm so I wanted to verify that I am not going to waste my time doing something totally insane. To that end the idea for my implementation is as follows:
Store the longitude and latitude for the users location in the Azure Table
The PartitionKey for each entry is the state (or country if outside the US) that the person lives in
I want to calculate the distance between the current user and all other users using the haversine equation. I'm assuming I can embed this into my LINQ query?
Eventually I'd like to be able to get a list of all states/countries within the radius so I can optimize based on PartitionKey's
Naturally this implementation strategy leads to a few questions:
If I do the haversine equation in a LINQ query where is it being executed? The last thing I want is my WebRole to pull every record from azure storage to run this haversine equation on it within the application process
Is the idea of doing this with Azure Storage totally crazy? I know there are solutions like MongoDB that have the capability to do spatial search queries built-in. I like the scalability of Azure but is there some better alternative I should investigate instead?
Thanks for the help in advance!

If you want the query to be fast against Azure tables, the query has to run against the partition key and row key. Also if you're using LINQ to query Azure tables then you need to be careful about which functions you use. If the value can be calculated once for all row, LINQ will be clever and evaluate it before sending the query to AZT, however if it needs to be evaluated for each row, then you can only use those are those that Azure supports (which is a pretty short list).
You might be able to make this work if you want to use a bounding square for your area not a bounding circle. Store latitude in the PartitionKey and longitude in the RowKey and have a query that looks like this:
var query = from UserLocation ul
in repository.All()
where
ul.PartitionKey.CompareTo(minimumLatitude) > 0
&& ul.PartitionKey.CompareTo(maximumLatitude) < 0
&& ul.RowKey.CompareTo(minimumLongitude) > 0
&& ul.RowKey.CompareTo(maximumLongitude) < 0
select
ul;
This is probably not as clever as you were hoping for though. If this is not going to work for you, then you'll need to look at other options. SQL Azure supports geospatial queries if you want to stay within the Microsoft family.

Having used spatial queries with SQL Azure (approximately 5 million records) i can confirm that it can be very quick - it may well be worth a look for you.

Related

Look ahead search on document fields in azure DocumentDb

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?
DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.
I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

Date function and Selecting top N queries in DocumentDB

I have following questions regarding Azure DocumentDB
According to this article, multiple functions have been added to
DocumentDB. Is there any way to get Date functions working? How can i
get the queries of type greater than some date working?
Is there any way to select top N results like 'Select top 10 * from users'?
According to Document playground , Order By will be supported in future. Is ther any other way around for now?
The application that I am developing requires certain number of results to be displayed that have been inserted recently. I need these functionalities within a stored procedure. The documents that I am storing in DocumentDB have a DateTime property. I require the above mentioned functionalities for my application to work. I have searched at documentation and samples. Please help if you know of any workaround.
Some thoughts/suggestions below:
Please take a look at this idea on how to store and query dates in DocumentDB (as epoch timestamps). http://azure.microsoft.com/blog/2014/11/19/working-with-dates-in-azure-documentdb-4/
To get top N results, set FeedOptions.MaxItemCount and read only one page, i.e., call ExecuteNextAsync() once. See https://msdn.microsoft.com/en-US/library/microsoft.azure.documents.linq.documentqueryable.asdocumentquery.aspx for an example. We're planning to add TOP to the grammar to make this easier in the future.
You can email me at arramac at microsoft dot com to get early access to Order By right away. This is planned for broad release shortly.
Please note that stored procedures are best used when you have a write operation(s). You'll be able to better throughput on reads when you query directly.

Location based data in azure table storage

I am designing an app which has a feature that allows a user to store geolocation based data, then later on allow other users to query for those data that falls within a given radius of their current geolocation.
The question is what is the best approach to design the table to be scalable and has great performance? I was thinking of having a table containing latitude as the partition key (pk), longitude as the row key (rk), then an dataid column that maps to another table that uses that dataid as its partition key.
My thinking is that using the 2-way lookup it's going to boost up my performance since both lookup would be instant. Is this the right way of thinking? I was reading somewhere that looking for partition keys that falls into some range is bad. If that is the case how should I approach this? How does Google maps, Apple map kit typically implement those feature?
This is difficult to answer completely without knowing your access patterns and scenarios fully. You can use the recently published Azure Storage Table design guide to help come up with a good design for your problem
http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/
For reads, Azure Table Storage is designed for fast point queries where client knows the partition key and row key so you need to factor that in your data model. For writes, you want uniform distribution and avoid append/prepend pattern to achieve high scale.

About Azure Table Secondary Index

I know the Secondary Index(s) is not here yet: It's in wish list and "planed"
I like to get some ideas (or information from the reliable source) about the incoming secondary index(s)
1st question: I noticed MS planed "secondary indexes": is that mean we can create as many indexes as we want on one table
2nd question: Current index is "PartitionKey+RowKey", if above question is not true, will the secondary index be "RowKey+PartitionKey" or we have a good chance that we can customize it?
I like to gain some ideas because I am currently design a table, since the data won't much from beginning, so I think I can wait for the secondary index feature without create multiple tables at this moment.
Please share you ideas or any source you have, thanks.
There's currently no information on secondary indexes, other than what's written at the site you referenced. So, there's no way to answer either of your two questions.
Several customers I work with, that use Table Storage, have taken the multiple-table approach to provide additional indexing. For those requiring extensive index coverage, that data typically has found its way into SQL Azure (or a combination of SQL Azure + Table Storage).
As a Windows Azure MVP I don't have any information about the secondary indexes in table service. If we do need more indexes in table service, but don't want use SQL Azure, (Not just because of the pricing...) then I would like to de-normalization my data, which split the same data into more than one table, with different row key as the indexes.
This question is now two years old. And still no sign of secondary indexes in Azure Table Storage. My guess is that it is now very unlikely to ever eventuate.
Azure Cosmos DB provides the Table API for applications that are written for Azure Table storage and that need capabilities like:
Automatic Secondary Indexing
From: Introduction to Azure Cosmos DB: Table API
So if you are willing to move your tables over to Cosmos, then you will get all the indexing you could ever want.

Efficiently Search Nearest Geographic Locations

I searched on SO and didn't really find an answer to this but it seems like a common problem.
I have several hundred thousand locations in a database, each having the geocode (lat/long). If it matters, they are spread out across the U.S. Now, I have a client app in which I want users to give me their lat/long and a radius (say 5mi, 10mi, 25mi, etc) and I want to return all the records that match. I only care abouot the distance value that can be gained via, say, the Haversine formula, not shortest road distance. However, given that, I want it to be as accurate as possible.
This database is mostly read-only. On a good day, there might be 10 inserts. Now, I will have hundreds of clients, maybe tens of thousands of clients that will be using the software. I want users to get results in a few seconds but if a single query takes 10-20 seconds, then it will crawl when hit with a load of clients.
How do I serve up results as efficiently as possible? I know I could just store them in MySQL or PostgreSQL (Oracle and MS SQL Server are out for this, but some other open source data store may be fine) and just put the Haversine formula in there WHERE clause but I don't think that is going to yield efficient results.
PostgreSQL supports a wide range of geospatial queries, so long as it's got the PostGIS extensions installed. Nearest or radius or bounding box searches are particularly easy.
I have used Solr (Lucene based search server) for radius search. We have written a property portal that lets user search the properties based on radius.
We index the database, so the search will be ultra fast.
I'm starting to feel like the offical spokesman for Sphinx. This article explains how to configure it for geospatial searching: http://www.god-object.com/2009/10/20/geospatial-search-using-sphinx-search-and-php/
To clarify: the data will be stored in mysql/postgres but indexed by and searched via Sphinx.

Resources