Audit Trail Design using Table Storage - azure

I'm considering implementing an Audit Trail for my application in using Table Storage.
I need to be able to log all actions for a specific customer and all actions for entities from that customer.
My first guess was creating a table for each customer (Audits_CustomerXXX) and use as a partition key the entity id and row key the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") value. And this works great when my question is what happened to certain entity? For instance the audit of purchase would have PartitionKey = "Purchases/12345" and the RowKey as the timestamp.
But when I want a birds eye view from the entire customer, can I just query the table sorting by row key across partitions? Or is it better to create a secondary table to hold the data with different partition keys? Also when using the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") is there a way to prevent errors when two actions in the same partition happen in the same tick (unlikely but who knows...).
Thanks

You could certainly create a separate table for the birds eye view but you really don't have to. Considering Azure Tables are schema-less, you can keep this data in the same table as well. You would keep the PartitionKey as reverse ticks and RowKey as entity id. Because you would be querying only on PartitionKey, you can also keep RowKey as GUID as well. This will ensure that all entities are unique. Or you could append a GUID to your entity id and use that as RowKey.
However do keep in mind that because you're inserting two entities with different PartitionKey values, you will have to safegaurd your code against possible network failures as each entry will be a separate request to Table service. The way we're handling this in our application is we write this payload to a queue message and then process that message through a background process.

Related

Azure Table Storage data modeling considerations

I have a list of users. A user can either login either using username or e-mail address.
As a beginner in azure table storage, this is what I do for the data model for fast index scan.
PartitionKey RowKey Property
users:email jacky#email.com nickname:jack123
users:username jack123 email:jacky#email.com
So when a user logs in via email, I would supply PartitionKey eq users:email in the azure table query. If it is username, Partition eq users:username.
Since it doesn't seem possible to simulate contains or like in azure table query, I'm wondering if this is a normal practice to store multiple row of data for 1 user ?
Since it doesn't seem possible to simulate contains or like in azure
table query, I'm wondering if this is a normal practice to store
multiple row of data for 1 user ?Since it doesn't seem possible to
simulate contains or like in azure table query, I'm wondering if this
is a normal practice to store multiple row of data for 1 user ?
This is a perfectly valid practice and in fact is a recommended practice. Essentially you will have to identify the attributes on which you could potentially query your table storage and somehow use them as a combination of PartitionKey and RowKey.
Please see Guidelines for table design for more information. From this link:
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with
different keys) to enable more efficient queries.

Cassandra changing Primary Key vs Firing multiple select queries

I have a table that stores list products that a user has. The table looks like this.
create table my_keyspace.userproducts{
userid,
username,
productid,
productname,
producttype,
Primary Key(userid)
}
All users belong to a group, there could be min 1 to max 100 users in a group
userid|groupid|groupname|
1 |g1 | grp1
2 |g2 | grp2
3 |g3 | grp3
We have new requirement to display all products for all users in a single group.
So do i change my userproducts so that my Partition Key is now groupid and make userid as my cluster key, so that i get all my results in one single query.
Or do I keep my table design as it is and fire multiple select queries by selecting all users in a group from second table and then fire one select query for each user, consolidate data in my code and then return it to the users
Thanks.
Even before getting to your question, your data modelling as you presented it has a problem: You say that you want to store "a list products that a user has". But this is not what the table you presented has - your table has a single product for each userid. The "userid" is the key of your table, and each entry in the table, i.e, each unique userid, has one combination of the other fields.
If you really want each user to have a list of products, you need the primary key to be (userid, productid). This means that each record is indexed by both a userid and a productid, or in other words - a userid has a list of records each with its own productid. Cassandra allows you to efficiently fetch all the productid records for a single userid because it implements the first part of the key as a "partition key" but the second part is a "clustering key".
Regarding your actual question, you indeed have two options: Either do multiple queries on your original tables, or do so-called denormalization, i.e., create a second table with exactly what you want searchable immediately. For the second option you can either do it manually (update both tables every time you have new data), or let Cassandra update the second table for you automatically, using a feature called Materialized Views.
Which of the two options - multiple queries or multiple updates - to use really depends on your workload. If it has many updates and rare queries, it is better to leave updates quick and make queries slower. If, on the other hand, it has few updates but many queries, it is better to make updates slower (when each update needs to update both tables) but make queries faster. Another important issue is how much query latency is important for you - the multiple queries option not only increases the load on the cluster (which you can solve by throwing more hardware at the problem) but also increases the latency - a problem which does not go away with more hardware and for some use cases may become a problem.
You can also achieve a similar goal in Cassandra by using the Secondary Index feature, which has its own performance characteristics (in some respects it is similar to the "multiple queries" solution).

nosql separate data by client

I have to develop a project using a NoSql base, either couchbase or cassandra.
I would like to know if it is recommended to partition the data of each customer in a bucket?
In my case, there will never be requests between the different clients.
The data can be completely separated.
For couchbase, I saw that for each bucket a memory capacity, was reserved for him.
Where does the separation have to be done at another place document or super column for cassandra.
Thank you
Where does the separation have to be done at another place document or super column for cassandra.
Tip #1, when working with Cassandra, completely erase the word "super column" from your vocabulary.
I would like to know if it is recommended to partition the data of each customer in a bucket?
That depends. It sounds like your queries would be mostly based on a customer id, so it makes sense to have it as a part of your partition key. However, if each customer partition has millions of rows and/or columns underneath it, that's going to get very big.
Tip #2, proper Cassandra modeling is done based on what your required queries look like. So without actually seeing the kinds of queries you need to serve, it's going to be difficult to be any more specific than that.
If you have customer data relating to accounts and addresses, etc, then building a customers table with a PRIMARY KEY of only customer_id might make sense. But if you find that you need to query your customers (for example) by email_address, then you'll want to create a customers_by_email table, duplicate your data into that, and create a PRIMARY KEY that supports that.
Additionally, if you find yourself storing data on customer activity, you may want to consider a customer_activity table with a PRIMARY KEY of PRIMARY KEY ((customer_id,month),activity_time). That will use both customer_id and month as a partition key, storing the customer's activity clustered by activity_time. In this case, if we didn't use month as an additional partition key, each customer_id partition would be continually written to, until it became too ungainly to write to or query (unbound row growth).
Summary:
If anyone tells you to use a super column in Cassandra, slap them.
You need to know your queries before you design your tables.
Yes, customer_id would be a good way to keep your data separate and ensure that each query is restricted to a single node.
-Build your partition keys to account for unbound row growth, to save you from writing too much data into the same partition.

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

What is the disadvantage to unique partition keys?

My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.

Resources