How to store and query spatial data on Azure Table Service

How to store and query spatial data on Azure Table Service - azure

I am trying to create a location based app where people can query the records created within 5 miles of their location.
When the record is created, I will store the Latitude and Longitude in the Azure Table Service.
Once I have this data, how do I fetch all the records within 5 miles from my current location?
Thank you.

For Azure Table Storage queries to be optimized, they'll need to run on the Partition Key and the Row Key. A solution could be to store the latitude in the Partition Key and the longitude in the RowKey. The Partition Key and Row Key combinations need to be unique (think primary key in SQL). I would use this strategy and if you have multiple entries for the same latitude and longitude, then you could use ATS' dynamic properties or InsertOrMerge to store them in the same row. That way you could query like this:
IQueryable<Entries> query =
(from q in _table.CreateQuery<Entries>()
where q.PartitionKey.CompareTo(minLatitude) > 0
&& q.PartitionKey.CompareTo(maxLatitude) < 0
&& q.RowKey.CompareTo(minLongitude) > 0
&& q.RowKey.CompareTo(maxLongitude) < 0
select q);
You could also get clever with the PartitionKey and use it to store a range of latitudes or regions in order to limit the # of partitions needed. SQL Azure also supports geospatial queries

Just doing work in this area and found that Azure Search will provide geospatial searches over Azure Table Storage. One setback is that the smallest scale beyond the developer sandbox is $200 per month - well worth the money for a commercial venture but rather high for small operations.
In order to make this work I needed to duplicate the Latatitude and Longitude fields into the GeoJson format, ie.
{"type": "Point", "coordinates": [102.0, 0.5]}
The free developer search option will allow one datasource based on partition key. For the purposes of my testing I have a table with everything in the same partition and unique RowKeys. I indexed the RowKey and the GeoJson value and found it works very well to search for all records within a radius of a given point.
While this is great, I think there or other storage solutions that will work better. DocumentDb and SqlAzure both support geospatial queries and, given the combination of the cost of storage and search, the cost of these alternatives is attractive.

Related

How to store coordinates in Azure Table storage

I need to store coordinates in Azure and had intended on using table storage.
My idea was to be able to query for a subset of coordinates based on two coordinates e.g:
So my query (I think) would be, give me all the points where
The latitude is less than 53.360238 and greater than 53.344204
The Longitude is greater than -6.276734 and less than -6.250122
I had originally thought about saving them as:
ParititonKey, RowKey
"16.775833,-3.009444", "Timbuktu"
...
But realised I would end up with thousands of partitions. I assumed that this would be really bad for doing a query as I would have to touch many partitions possibly on different networks.
Also I'm not sure how it would work given a partition / row query is a string comparison..
I was wondering if there was a better way to store the points, for example I was thinking something like:
ParititonKey, RowKey, Title
16.775833,-3.009444, "Timbuktu"
...
This makes the query easier but doesn't solve the unique partition problem e.g
Get all entites where partition key is less than X and greater than Y AND where RowKey is greater than A and smaller than B
Is there a more efficient way to do this, perhaps by saving the whole number of the latitude as the partition key and the remainder in the RowKey?
ParititonKey, RowKey, Title
16, 775833^-3.009444, "Timbuktu"
...
Any advice is appreciated!

My suggestion would be to use DocumentDb to store this kind of unstructured data and you can easily write SQL like queries on more than one field.
Table storage is built more for key value pairs only

Questions about storing data in SQL or Table Storage

I have many questions on whether to store my data into SQL or Table Storage and the best way to store them for efficiency.
Use Case:
I have around 5 million rows of objects that are currently stored in mysql database. Currently the metadata is stored only in the database. (Lat, Long, ID, Timestamp). The other 150 columns about the object that are not used much were moved into the Table Storage.
In the table storage, should these all be stored in one row with all the 150 columns not used much in one column instead of multiple rows?
For each of these 5 million objects in the database, there are certain information about them (temperature readings, trajectories, etc). The trajectory data used to be stored in SQL (~300 rows / object) but were moved to table storage to be cost effective. Currently they are stored in the table storage in a relational manner where each row looks like (PK: ID, RK: ID-Depth-Date, X, Y, Z).
Currently it takes time time grab many of the trajectories data. Table Storage seems to be pretty slow in our case. I want to improve the performance of the gets. Should the data be stored where each Objects has 1 row for its trajectory and all the XYZ's are stored in 1 column in a JSON format? Instead of 300 rows to get, it only needs to get 1 row.
Is the table storage the best place to store all of this data? If I wanted to get a X,Y,Z at a certain Measured Depth, I would have to get the whole row and parse through the JSON. THis is probably a trade-off.
Is it feasible to have the trajectory data, readings, etc in a sql database where there can be (5,000,000 x 300 rows) for the trajectory data. THere is also some information about the objects where it can be (5,000,000 x 20,000 rows). This is probably too much for a SQL database and would have to be in a Azure CLoud Storage. If so, would the JSON option be the best one? The tradeoff is that if I want a portion which is 1000 rows, I would have to get the whole table, however, isnt that faster than querying through 20,000 rows. I can probably split the data into sets of 1000 rows and use sql as a meta data for finding out which sets of data I need from the Cloud Storage.
Pretty much I'm having trouble understanding how to group data and format it into Azure Cloud Tables to be efficient and fast when grabbing data for my application.
Here's an example of my data and how I am getting it: http://pastebin.com/CAyH4kHu

As an alternative to table storage, you can consider using Azure SQL DB Elastic Scale to spread trajectory data (and associated object metadata) across multiple Azure SQL DBs. This allows you to overcome capacity (and compute) limits of a single database. You would be able to perform object-specific queries or inserts efficiently, and have options to perform queries across multiple databases -- assuming you are working with a .Net application tier. You can find out more by looking at http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-get-started/

Windows Azure table access latency Partition keys and row keys selection

We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?

Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.

The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.

Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.

Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.

I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH

The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

Azure - Querying 200 million entities

I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?

Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.

The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string