I'm still trying to get my head around the correct way to use Azure Tables. I understand that they have a partition key and a row key, that that's it. Everything else is just data that you keep in that row.
Use Case
My web app gets files uploaded by a user, puts them in a queue, then has a worker roll process the queue and do analytics on those files.
I would like to put messages about those files in an Azure Table based on what we find when we process those files.
I then plan on making an AJAX call to get a members messages when they visit a webpage. If the user clicks on the message or closes the message then I'll delete it from the table. Very StackOverflowish.
Question
My question is on how to best store these messages in my Azure Table.
Here's my thinking so far:
PartionKey: MemberID
RowKey: ???(not sure what to have)
Column Data: Message data including any links and a time stamp. Probably a view count too.
I can't think of what I would put in a seperate index for the row key. Timestamp could work so I can order messages correctly, but I don't think I'll get much bang for my buck with that.
I have found that the best to think about the choice of partition and row keys is to think about the data access patterns. If your access pattern is to have a single row/entity represent something meaningful in your system. In your case is sounds like userid/fileid uniquely identifies the entity. From this, you have three options:
userid for partition key, fileid for row key
constant value for partition key, and a combination of userid and fileid for row key
constant value for row key, and a combination of userid and fileid for partition key
The decision on there is to figure out what other access pattern. Are you going to be querying for all files for a particular user? Then you would want userid as partition or row key. If you will only ever be querying based on fileid/userid, then it doesn't really matter.
Erick
Before thinking about actual storage, you should try to think about what entities you're going to have.
Sounds like something like this:
User entity
UserFile entity
FileMessage entity
Do you have one FileMessage per UserFile or can you have more than one? It sounds like (by your explanation of deletion logic) that you would only have one FileMessage per File.
If my assumptions so far are correct and if it were me, the FileMessage table would have the following structure:
PartitionKey: userId
RowKey: fileId (name/url/etc)
Other columns: as you see fit
HTH
I would think of it as: Partition Key is how you want to break data out, so if data is related, you want to keep the partition key the same. If you are doing something with a lot of data, you may want to use like the date for the Partition Key. The Row Key is the index, so that is what you will use to query the data.
Related
I'm trying to move the azure storage log file into azure storage tables so I can more easily work with them, but I noticed this
"duplicate log records may exist in logs generated for the same hour
and can be detected by checking for duplicate RequestId and Operation
number."
source:https://blogs.msdn.microsoft.com/windowsazurestorage/2011/08/02/windows-azure-storage-logging-using-logs-to-track-storage-requests/
(I know it's an old article, but it's all I can find)
With this in mind, I thought it would be sensible to use a concatenation of the requestID with the operationID as my row key.
I wanted to check if anyone is aware just how unique the requestID is (Apparently some requests might have more that 1 operation such as "copy", but most will have just 1).
If I'm using it as a row key, I can't afford for it to appear twice in the same partition (Partitioning by userID, but lets suppose each partition can contain millions of records).
Thanks
If I'm using it as a row key, I can't afford for it to appear twice in the same partition (Partitioning by userID, but lets suppose each partition can contain millions of records).
If I understand correctly, you could combine requestID and new Guid with hyphenation as unique row key. for example: requestId|newGuid.
I have to develop a project using a NoSql base, either couchbase or cassandra.
I would like to know if it is recommended to partition the data of each customer in a bucket?
In my case, there will never be requests between the different clients.
The data can be completely separated.
For couchbase, I saw that for each bucket a memory capacity, was reserved for him.
Where does the separation have to be done at another place document or super column for cassandra.
Thank you
Where does the separation have to be done at another place document or super column for cassandra.
Tip #1, when working with Cassandra, completely erase the word "super column" from your vocabulary.
I would like to know if it is recommended to partition the data of each customer in a bucket?
That depends. It sounds like your queries would be mostly based on a customer id, so it makes sense to have it as a part of your partition key. However, if each customer partition has millions of rows and/or columns underneath it, that's going to get very big.
Tip #2, proper Cassandra modeling is done based on what your required queries look like. So without actually seeing the kinds of queries you need to serve, it's going to be difficult to be any more specific than that.
If you have customer data relating to accounts and addresses, etc, then building a customers table with a PRIMARY KEY of only customer_id might make sense. But if you find that you need to query your customers (for example) by email_address, then you'll want to create a customers_by_email table, duplicate your data into that, and create a PRIMARY KEY that supports that.
Additionally, if you find yourself storing data on customer activity, you may want to consider a customer_activity table with a PRIMARY KEY of PRIMARY KEY ((customer_id,month),activity_time). That will use both customer_id and month as a partition key, storing the customer's activity clustered by activity_time. In this case, if we didn't use month as an additional partition key, each customer_id partition would be continually written to, until it became too ungainly to write to or query (unbound row growth).
Summary:
If anyone tells you to use a super column in Cassandra, slap them.
You need to know your queries before you design your tables.
Yes, customer_id would be a good way to keep your data separate and ensure that each query is restricted to a single node.
-Build your partition keys to account for unbound row growth, to save you from writing too much data into the same partition.
I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.
I have a list of known partition and row key pairs from the same table, (e.g P1R1, P2R2, P3R3, P-PartitionKey, R-Row key), anyone know how to query from Azure Table Storage to get these 3 entities in one request?
I don't think there's an option besides just spelling it out in the Where/Filter clause explicitly.
You can do that, but you don't want to. It will result in a table scan and be extremely slow. You'd be much better served by firing off each request separately.
If you're dead set on doing it with one query, comment here and I'll get you the code. I'm on my iPad right now so I don't have it handy.
I am just stuck in a design problem. I want to assign ranks to user records in a table. They do some action on the site and given a rank on basis of leader board. And the select I want on them could be on Top 10, User's position, Top 10 logged in today etc.
I just can not find a way to store it in Azure table. Than I thought about storing custom collection object (a sorted list) in blob.
Any suggestions?
Table entities are sorted by PartitionKey, RowKey. While you could continually delete and recreate users (thus allowing you to change the PK, RK) to give the correct order, it seems like a bad idea or at least overkill. Instead, I would probably store the data that you use the compute the rankings and periodically compute and store the rankings (as you say). We do this a lot in our work - pre-compute what the data should look like in JSON view, store it in a blob, and let the UI query it directly. The trick is to decide when to re-compute the view. After a user does an item that would cause the rankings to be re-computed, I would probably queue a message and let a worker process go and re-compute the view. This prevents too many workers from trying to update the data at once.