Setting RowKey in Azure Table Storage - azure

I am storing some logs in Azure Table Storage. I've identified the PartitionKey I should use. However, I'm having trouble determining what I should use for the RowKey. If I was using Sql Server, I would use an auto-incrementing integer. From what I can tell, having an auto-generated RowKey is not an option with Azure Table Storage. I'm fine using a GUID, however, everyone seems to warn against using a GUID. Yet, I'm not sure what I should be using.
Can anyone provide me a pointer for what I should use as the RowKey for storing log data? I've seen the following syntax (RowKey: {'_': '1'}), as shown below, but can't find out what it means:
var task = {
PartitionKey: {'_':'hometasks'},
RowKey: {'_': '1'}
};
Thanks!

There are many approaches you can take. One such approach would be to store date/time value in ticks as RowKey. This would help you in fetching logs data for a particular time range. Just remember that since RowKey is of String data type, you may want to pre-pad it with zeros so that all values are of same length. For example,
DateTime.UtcNow.Ticks.ToString("d20")
With this, you could take 2 approaches:
Store them in chronological order as shown in example above.
Store them in reverse chronological order. The advantage of this approach is that the latest entries will always be added on top of the table. So you could just query the table on PartitionKey and get top 'x' rows and they will be latest. You will do something like:
(DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d20")
You also mentioned in your comment that I expect the data sets to be quite large.. I hope you are not using a single PartitionKey. Because if number of records are quite large and all of them are put in same partition, the performance might be impacted.

Related

What is the best way of creating Partition key in azure table storage for storing sensor output data?

I searched best practice for storing sensor output data in azure table storage but didn't get best answer. I am currently working on a project that consists of storing sensor data to azure table storage. Currently I am using partition key as Sensor ID . Every second I am storing the sensor outputs. About 100 sensors are currently using. So imagine large data is storing every day. So I am getting slow performance in my web application when i searched particular sensor data by date wise. Is there a better way to improve the performance of the web app? How about changing sensor id to date as partition key? Code is not important here. I need a logical solution.. May be this question will help lot of developers who are working on such scenario..
UPDATE
Each sensor provides a 10 different outputs and date which is the output datetime. So they are in a same row of each sensor id. And I am taking sensor data using Date range and Sensor id
Partition key - sensor id , RowKey - datetime , 10 output columns and output date
here is my code
var query = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sensorID);
var dateFilter = TableQuery.CombineFilters(
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.GreaterThanOrEqual, Convert.ToDateTime(from)),
TableOperators.And,
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.LessThanOrEqual, Convert.ToDateTime(to))
);
query = TableQuery.CombineFilters(query, TableOperators.And, dateFilter);
var rangeQuery = new TableQuery<TotalizerTableEntity>().Where(query);
var entitys = table.ExecuteQuery(rangeQuery).OrderBy(j => j.date).ToList();
outputdate indicates output generated time. This is getting as datetime. All output have same output time.
First, I would highly recommend that you read Azure Storage Table Design Guide: Designing Scalable and Performant Tables. This will give you a lot of ideas about how to structure your data.
Now coming to your current implementation. What I am noticing is that you're including PartitionKey in your query (which is very good BTW) but then adding a non-indexed attribute (outputdate) in your query as well. This will result in what is known is Partition Scan. For larger tables, this will create a problem because your query will be scanning the entire partition for matching outputdate attribute.
You mentioned that you're storing datetime value is RowKey. Assuming the RowKey value matches with the value of output date, I would recommend using RowKey in your query instead of this non-indexed attribute. RowKey (along with PartitionKey) are the only two attributes that are indexed in a table, so the query will be comparatively much faster.
When saving date/time as RowKey, I would recommend converting it into ticks (DateTime.Ticks) and then saving that instead of simply converting the date/time value to string. If you're going with this approach, I would suggest prepending 0 in front of this ticks so that all values are of same length (doing something like DateTime.Ticks.ToString("d19")).
You can also save the RowKey as Reverse Ticks i.e. (DateTime.MaxValue.Ticks - DateTime.Ticks).ToString("d20"). This will ensure that all the latest entries get added to the top of the table instead of at the bottom. This will help in scenario where you are more interested in querying the latest records.
If you will always query for a particular sensor, it may not hurt to save data for each sensor in a separate table i.e. each sensor gets a separate table. This will free up one key for you. You can use date/time value (which you're currently storing as RowKey) as PartitionKey and can use some other value as RowKey. Furthermore, it will allow you to scale across storage accounts - data for some sensors will go in one storage account while the data for other sensors will go in other storage account. Somewhere you just need to save this relationship so that data reaches correct storage account/table.

Unreliable performance in Azure Table Storage when joining point queries together

I am having a consistent problem with the performance of Azure Table Storage. I'm querying a table which holds user accounts. The table stores the userId in both the PartitionKey and RowKey so I can easily make point queries.
My issue is because in several cases I need to retrieve multiple users in a single query. To achieve that I have a class which builds filter strings for me. The manner which this works is not related to the problem, however this is an example of the output:
(PartitionKey eq '00540de6-dd2b-469f-8730-e7800e06ccc0' and RowKey eq '00540de6-dd2b-469f-8730-e7800e06ccc0') or
(PartitionKey eq '02aa11b7-974a-4ee9-9a8e-5fc09970bb99' and RowKey eq '02aa11b7-974a-4ee9-9a8e-5fc09970bb99') or
(PartitionKey eq '040aec50-ebcd-4e5d-8f58-82aea616bd82' and RowKey eq '040aec50-ebcd-4e5d-8f58-82aea616bd82') or
// up to 22 more (25 total)
Upon first execution of the query it takes a long time to execute, between 2-5 seconds, and is missing data which is leading to errors. When run a second time the query takes between 0.2 and 0.5 seconds to complete and has all data contained within it.
Note that I also tried it just supplying just the PartitionKey, however it made no difference. I had assumed that a point query would perform better.
From this presentation of the bug I can only presume it's caused by the data being 'cold' when first requested and then pulled from a 'hot' cache upon successive requests.
If this is the case, how can I change the filter string to improve performance? Alternatively, how can I change the timeout of the table storage query to give it more time to complete? Is it possible to increase the scaling of my table storage?
Please don't use point query strings concatenated with 'or', since Azure Storage Table can't treat it as multiple point queries. Instead, Azure Table will treat it as a full table scan, which is terrible in performance. You should execute 25 point queries respectively to improve performance.

partitionkey and rowkey in azure table storage

I understand the benefit of a partition key in azure table storage. However, given my relational database background, I am a bit confused about how to retrieve an entity from azure table storage given just the rowkey. As far as I know, this is impossible. This means that I have to store the partition key/rowkey pair somewhere to just get the entity given the rowkey. Should I just introduce a 'sharding' table with one arbitrary partition key, which allows me to look up the partition key given the rowkey?
It is possible but will result in a table scan as described in this section of MSDN.
If you don't need multiple partitions then it is absolutely fine to use a single partition (e.g. using a constant) if your data isn't going to be enormous in size and needs the scalability of multiple partitions.
Another possible approach is to use your current RowKey as PartitionKey which would give you a highly scalable solution but would result in bad performance if you need to query ranges of rows.
The linked MSDN page talks about the pros and cons of both so I think with your knowledge about your specific problem domain you should be able to find a balanced solution.

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

Strategy for storing application logs in Azure Table Storage

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.
Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.
Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

Resources