Location based data in azure table storage - azure

I am designing an app which has a feature that allows a user to store geolocation based data, then later on allow other users to query for those data that falls within a given radius of their current geolocation.
The question is what is the best approach to design the table to be scalable and has great performance? I was thinking of having a table containing latitude as the partition key (pk), longitude as the row key (rk), then an dataid column that maps to another table that uses that dataid as its partition key.
My thinking is that using the 2-way lookup it's going to boost up my performance since both lookup would be instant. Is this the right way of thinking? I was reading somewhere that looking for partition keys that falls into some range is bad. If that is the case how should I approach this? How does Google maps, Apple map kit typically implement those feature?

This is difficult to answer completely without knowing your access patterns and scenarios fully. You can use the recently published Azure Storage Table design guide to help come up with a good design for your problem
http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/
For reads, Azure Table Storage is designed for fast point queries where client knows the partition key and row key so you need to factor that in your data model. For writes, you want uniform distribution and avoid append/prepend pattern to achieve high scale.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Are dummy partition keys always bad?

I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether. By dummy, I mean a column whose only purpose is to contain the same value for all rows, thereby putting all data on 1 node and giving the lowest possible cardinality. For example:
dummy | id | name
-------------------------
0 | 01 | 'Oliver'
0 | 02 | 'James'
0 | 03 | 'Nicholls'
The two main points in regards to why you should avoid dummy partition keys are:
1) You end up with data "hot-spots". There is a lot of data stored on 1 node so there's more traffic around that node and you have poor distribution around the cluster.
2) Partition space is finite. If you put all data on one partition, it will eventually be incapable of storing any more data.
I can understand these points and I agree that you definitely want to avoid those situations, so I put this idea out of my mind and tried to think of a good partition key for my table. The table in question stores sites and there are two common ways that table gets queried in our system. Either a single site is requested or all sites are requested.
This puts me in a bit of an awkward situation, because the table is either queried on nothing or the site ID, and making a unique field the partition key would give me very high cardinality and high latency on queries that request all sites.
So I decided that I'd just choose an arbitrary field that would give relatively low cardinality, even though it doesn't reflect how the data will actually be queried, just because it's better than having a cardinality that is either excessively high or excessively low. This approach also has problems though.
I could partition my data on column x, but we have numerous clients, all of whom use our system differently, so x for 1 client could give the results I'm after, but could give awful results for another.
At this point I'm running out of options. I need a field in my table that will be consistent for all clients, however this field doesn't exist, so I'm now considering having a new field that will contain a random number from 1-3 and then partitioning on that field, which is essentially just a dummy field. The only difference is that I want to randomise the values a little bit as to avoid hot-spots and unbounded row growth.
I know this is a data-modelling question and it varies from system to system, and of course there are going to be situations where you have to choose the lesser of two evils (there is no perfect solution), but what I'm really focussed on with this question is:
Are dummy partition keys something that should outright never be a consideration in Cassandra, or are there situations in which they're seen as acceptable? If you think the former, then how would you approach this situation?
I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether.
I'm going to go out on a limb and guess that your search has yielded my article We Shall Have Order!, where I made my position on the use of "dummy" partition keys quite clear. Bearing that in mind, I'll try to provide some alternate solutions.
I see two potential problems to solve here. The first:
I need a field in my table that will be consistent for all clients, however this field doesn't exist
Typically this is solved by duplicating your data into another query table. That's the best way to serve multiple, varying query patterns. If you have one client (service?) that needs to query that table by site id, then you could have that table duplicated into a table called sites_by_id.
CREATE TABLE sites_by_id (
id BIGINT,
name TEXT,
PRIMARY KEY (id));
The other problem is this query pattern:
all sites are requested
Another common Cassandra anti-pattern is that of unbound SELECTs (SELECT query without a WHERE clause). I am sure you understand why these are bad, as they require all nodes/partitions to be read for completion (which is probably why you are looking into a "dummy" key). But as the table supporting these types of queries increases in size, they will only get slower and slower over time...regardless of whether you execute an unbound SELECT or use a "dummy" key.
The solution here is to perform a re-examination of your data model, and business requirements. Perhaps your data can be split up into sites by region or country? Maybe your client really only needs the sites that have been updated for this year? Obtaining some more details on the client's query requirements may help you find a good partitioning key for them to use. Otherwise, if they really do need all of them all of the time, then doanduyhai's suggestion of using Spark will better fit your use case.
or all sites are requested
So basically you have a full table scan scenario. Isn't Apache Spark over Cassandra a better fit for this use-case ? I suspect it's an analytics use-case, isn't it ?
As far as I understand, you want to access a single site by its id, in which case lookup by partition key is ideal. The other use-case which requires to fetch all the sites is best suited with Spark

Cassandra data modelling less then 1000 records to fit in one row

We have some entity uniquely identified by generated UUID. We need to support find by name query. Also we need to support sorting to be by name.
We know that there will be no more than 1000 of entities of that type which can perfectly fit in one row. Is it viable idea to hardcode primary key, use name as clustering key and id as clustering key there to satisfy uniqueness. Lets say we need school entity. Here is example:
CREATE TABLE school (
constant text,
name text,
id uuid,
description text,
location text,
PRIMARY KEY ((constant), name, id)
);
Initial state would be give me all schools and then filtering by exact name will happen. Our reasoning behind this was to place all schools in single row for fast access, have name as clustering column for filtering and have id as clustering column to guaranty uniqueness. We can use constant = school as known hardcoded value to access this row.
What I like about this solution is that all values are in one row and we get fast reads. Also we can solve sorting easy by clustering column. What I do not like is hardcoded value for constant which seams odd. We could use name as PK but then we would have 1000 records spread across couple of partitions, probably find all without name would be slower and would not be sorted.
Question 1
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
Question 2
Name is editable field, it will probably be changed rarely (someone can make typo or school can change name) but it can change. What is best way to achieve this? Delete insert inside batch (LTE can be applied to same row with conditional clause)?
Yes this is a good approach for such a small dataset. Just because Cassandra can partition large datasets across multiple nodes does not mean that you need to use that ability for every table. By using a constant for the partition key, you are telling Cassandra that you want the data to be stored on one node where you can access it quickly and in sorted order. Relational databases act on data in a single node all the time, so this is really not such an unusual thing to do.
For safety you will probably want to use a replication factor higher than one so that there are at least two copies of the single partition. In that way you will not lose access to the data if the one node where it is stored went down.
This approach could cause problems if you expect to have a lot of clients (i.e. thousands of clients) frequently reading and writing to this table, since it could become a hot spot. With only 1000 records you can probably keep all the rows cached in memory by setting the table to cache all keys and rows.
You probably won't find a lot of examples where this is done because people move to Cassandra for the support of large datasets where they want the scalability that comes from using multiple partitions. So examples are geared towards that.
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
I briefly addressed this type of modeling solution earlier this year in my article: We Shall Have Order! This is what is known as a "dummy key," where each row has the same partition key. This is a shortcut that allows you to easily order all of your rows (on an unbound SELECT *) by clustering column(s).
Problems with this solution:
Cassandra allows a maximum of 2 billion column values per partition key. When using a dummy partition key, you will approach this limit with each value that you add.
Your data will all be stored in the same partition, which will create a "hot spot" (large groupings of data) in your cluster. This means that your data model will immediately void one of Cassandra's main benefits...data distribution. This will also complicate load balancing (the same nodes and ranges will keep serving all of your requests).
I can see that your model is designed around a SELECT * query. Cassandra works best when you can give it specific keys to query by. Unbound SELECT * queries (queries without WHERE clauses) are not a good idea to be doing with Cassandra, as they can lead to timeouts (as your data grows).
From reading through your question, I know that you're going to say that you're only using it for 1000 rows. That your dataset won't ever grow much beyond those 1000 rows, so you won't hit any of the roadblocks that I have mentioned.
So then I have to wonder, why are you using Cassandra? As a Cassandra MVP, that's a question I don't ask often. But you don't have an especially large data set (which is what Cassandra is designed to work with). Relying on that fact as a reason to use a product incorrectly is not really the best solution.
Honestly, I am going to recommend that you save yourself some complexity, and use a RDBMS instead. That will fit your use case significantly better than Cassandra will. Then you can update and order by whatever fields you wish.

is the notion of a Table required at all in Azure Table service?

I am working with azure table services. What I would like to ask is whether the Azure internals care about the notion of Table at all?
Making things fast largely depend on partition key and row key. Tables do not look like containers or Grouping of entities because there is no limit on number of tables you can create. The total storage size is tied to my storage account.
So is Table just a notion to help people transition from RDBMS land or do they serve a purpose internally? Can I do applications with a single table design without worrying about performance? After all, if table is a just a tag then I may as well include as part of partition key.
EDIT
To give an example, Table partition keys look very much like Cassandra rows and Table rows are like Cassandra columns. It is okay to treat the Storage as a big bucket of key(RowKey)-value pairs. Partition keys are sharding mechanism. Then table just comes across as a "labeling" notion.
You can have all your entities in a single azure table storage and get optimum performance by thoughtfully choosing a right Partition Key and Row Key combination, but IMHO it is any time better to have related entities each in a separate table as that would be easy manageable ( from a developer perspective) . you know what part of you application is hitting which table.
You may want to view this recorded session which discusses best practices and internals.
Windows Azure Storage: What’s Coming, Best Practices, and Internals
You can think the union of a Table name and a Partition Key as being a unit of performance.
Youre probably on the thought mindset that will re-invent FatEntities. developed by Lokad.
The concept of a "Table" has many restrictions and issues. For example if you have a large table with 100,000 entries in a partition, then you can't easily jump to entry 99,001 without iterating through each of the entries. Going backwards is impossible (you cant start at the last entry and go backwards)

About Azure Table Secondary Index

I know the Secondary Index(s) is not here yet: It's in wish list and "planed"
I like to get some ideas (or information from the reliable source) about the incoming secondary index(s)
1st question: I noticed MS planed "secondary indexes": is that mean we can create as many indexes as we want on one table
2nd question: Current index is "PartitionKey+RowKey", if above question is not true, will the secondary index be "RowKey+PartitionKey" or we have a good chance that we can customize it?
I like to gain some ideas because I am currently design a table, since the data won't much from beginning, so I think I can wait for the secondary index feature without create multiple tables at this moment.
Please share you ideas or any source you have, thanks.
There's currently no information on secondary indexes, other than what's written at the site you referenced. So, there's no way to answer either of your two questions.
Several customers I work with, that use Table Storage, have taken the multiple-table approach to provide additional indexing. For those requiring extensive index coverage, that data typically has found its way into SQL Azure (or a combination of SQL Azure + Table Storage).
As a Windows Azure MVP I don't have any information about the secondary indexes in table service. If we do need more indexes in table service, but don't want use SQL Azure, (Not just because of the pricing...) then I would like to de-normalization my data, which split the same data into more than one table, with different row key as the indexes.
This question is now two years old. And still no sign of secondary indexes in Azure Table Storage. My guess is that it is now very unlikely to ever eventuate.
Azure Cosmos DB provides the Table API for applications that are written for Azure Table storage and that need capabilities like:
Automatic Secondary Indexing
From: Introduction to Azure Cosmos DB: Table API
So if you are willing to move your tables over to Cosmos, then you will get all the indexing you could ever want.

Resources