Azure Table Storage data modeling considerations - azure

I have a list of users. A user can either login either using username or e-mail address.
As a beginner in azure table storage, this is what I do for the data model for fast index scan.
PartitionKey RowKey Property
users:email jacky#email.com nickname:jack123
users:username jack123 email:jacky#email.com
So when a user logs in via email, I would supply PartitionKey eq users:email in the azure table query. If it is username, Partition eq users:username.
Since it doesn't seem possible to simulate contains or like in azure table query, I'm wondering if this is a normal practice to store multiple row of data for 1 user ?

Since it doesn't seem possible to simulate contains or like in azure
table query, I'm wondering if this is a normal practice to store
multiple row of data for 1 user ?Since it doesn't seem possible to
simulate contains or like in azure table query, I'm wondering if this
is a normal practice to store multiple row of data for 1 user ?
This is a perfectly valid practice and in fact is a recommended practice. Essentially you will have to identify the attributes on which you could potentially query your table storage and somehow use them as a combination of PartitionKey and RowKey.
Please see Guidelines for table design for more information. From this link:
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with
different keys) to enable more efficient queries.

Related

Implement Paging on Azure Cosmos Db Data coming from two seperate documents

We have two separate set of documents in Cosmos Db, one storing User and it's various roles and second set of documents storing the permission to a particular job.
Now, the job list is unbounded and can grow substantially over a period of time. As group by is not allowed on multiple documents, we are trying to figure out the best strategy to implement a way on retrieving all users either based on role or particular job.
1) Solution 1 - Keep User data and job data as sub documents in a big long document and helps with querying and even continuation tokens.
2) Solution 2 - Keep user and role data in 1 documents and multiple job documents and query on the client side separately and perform query there. In this case the continuation token support is lost, as you have to query the complete data first to provide any meaningful results.
3) Solution 3 - Keep the role data with each job document and directly query on it. In this case, we will get number of users based on job and then make single query/user to get their information.
Can anyone recommend a better solution or pick from above 3 and suggest a path forward?
It seems that you need extra storage to store the relationship. We could use Azure SQL to store the relationship of user(documentId, userid, role id), role,job. Then store the incertain property info such as useinfo into Documentdb.

Audit Trail Design using Table Storage

I'm considering implementing an Audit Trail for my application in using Table Storage.
I need to be able to log all actions for a specific customer and all actions for entities from that customer.
My first guess was creating a table for each customer (Audits_CustomerXXX) and use as a partition key the entity id and row key the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") value. And this works great when my question is what happened to certain entity? For instance the audit of purchase would have PartitionKey = "Purchases/12345" and the RowKey as the timestamp.
But when I want a birds eye view from the entire customer, can I just query the table sorting by row key across partitions? Or is it better to create a secondary table to hold the data with different partition keys? Also when using the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") is there a way to prevent errors when two actions in the same partition happen in the same tick (unlikely but who knows...).
Thanks
You could certainly create a separate table for the birds eye view but you really don't have to. Considering Azure Tables are schema-less, you can keep this data in the same table as well. You would keep the PartitionKey as reverse ticks and RowKey as entity id. Because you would be querying only on PartitionKey, you can also keep RowKey as GUID as well. This will ensure that all entities are unique. Or you could append a GUID to your entity id and use that as RowKey.
However do keep in mind that because you're inserting two entities with different PartitionKey values, you will have to safegaurd your code against possible network failures as each entry will be a separate request to Table service. The way we're handling this in our application is we write this payload to a queue message and then process that message through a background process.

partitionkey and rowkey in azure table storage

I understand the benefit of a partition key in azure table storage. However, given my relational database background, I am a bit confused about how to retrieve an entity from azure table storage given just the rowkey. As far as I know, this is impossible. This means that I have to store the partition key/rowkey pair somewhere to just get the entity given the rowkey. Should I just introduce a 'sharding' table with one arbitrary partition key, which allows me to look up the partition key given the rowkey?
It is possible but will result in a table scan as described in this section of MSDN.
If you don't need multiple partitions then it is absolutely fine to use a single partition (e.g. using a constant) if your data isn't going to be enormous in size and needs the scalability of multiple partitions.
Another possible approach is to use your current RowKey as PartitionKey which would give you a highly scalable solution but would result in bad performance if you need to query ranges of rows.
The linked MSDN page talks about the pros and cons of both so I think with your knowledge about your specific problem domain you should be able to find a balanced solution.

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

What is the disadvantage to unique partition keys?

My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.

Resources