Strategy for storing application logs in Azure Table Storage - azure

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.

Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.

Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

Related

What is the best way of creating Partition key in azure table storage for storing sensor output data?

I searched best practice for storing sensor output data in azure table storage but didn't get best answer. I am currently working on a project that consists of storing sensor data to azure table storage. Currently I am using partition key as Sensor ID . Every second I am storing the sensor outputs. About 100 sensors are currently using. So imagine large data is storing every day. So I am getting slow performance in my web application when i searched particular sensor data by date wise. Is there a better way to improve the performance of the web app? How about changing sensor id to date as partition key? Code is not important here. I need a logical solution.. May be this question will help lot of developers who are working on such scenario..
UPDATE
Each sensor provides a 10 different outputs and date which is the output datetime. So they are in a same row of each sensor id. And I am taking sensor data using Date range and Sensor id
Partition key - sensor id , RowKey - datetime , 10 output columns and output date
here is my code
var query = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sensorID);
var dateFilter = TableQuery.CombineFilters(
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.GreaterThanOrEqual, Convert.ToDateTime(from)),
TableOperators.And,
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.LessThanOrEqual, Convert.ToDateTime(to))
);
query = TableQuery.CombineFilters(query, TableOperators.And, dateFilter);
var rangeQuery = new TableQuery<TotalizerTableEntity>().Where(query);
var entitys = table.ExecuteQuery(rangeQuery).OrderBy(j => j.date).ToList();
outputdate indicates output generated time. This is getting as datetime. All output have same output time.
First, I would highly recommend that you read Azure Storage Table Design Guide: Designing Scalable and Performant Tables. This will give you a lot of ideas about how to structure your data.
Now coming to your current implementation. What I am noticing is that you're including PartitionKey in your query (which is very good BTW) but then adding a non-indexed attribute (outputdate) in your query as well. This will result in what is known is Partition Scan. For larger tables, this will create a problem because your query will be scanning the entire partition for matching outputdate attribute.
You mentioned that you're storing datetime value is RowKey. Assuming the RowKey value matches with the value of output date, I would recommend using RowKey in your query instead of this non-indexed attribute. RowKey (along with PartitionKey) are the only two attributes that are indexed in a table, so the query will be comparatively much faster.
When saving date/time as RowKey, I would recommend converting it into ticks (DateTime.Ticks) and then saving that instead of simply converting the date/time value to string. If you're going with this approach, I would suggest prepending 0 in front of this ticks so that all values are of same length (doing something like DateTime.Ticks.ToString("d19")).
You can also save the RowKey as Reverse Ticks i.e. (DateTime.MaxValue.Ticks - DateTime.Ticks).ToString("d20"). This will ensure that all the latest entries get added to the top of the table instead of at the bottom. This will help in scenario where you are more interested in querying the latest records.
If you will always query for a particular sensor, it may not hurt to save data for each sensor in a separate table i.e. each sensor gets a separate table. This will free up one key for you. You can use date/time value (which you're currently storing as RowKey) as PartitionKey and can use some other value as RowKey. Furthermore, it will allow you to scale across storage accounts - data for some sensors will go in one storage account while the data for other sensors will go in other storage account. Somewhere you just need to save this relationship so that data reaches correct storage account/table.

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

Setting RowKey in Azure Table Storage

I am storing some logs in Azure Table Storage. I've identified the PartitionKey I should use. However, I'm having trouble determining what I should use for the RowKey. If I was using Sql Server, I would use an auto-incrementing integer. From what I can tell, having an auto-generated RowKey is not an option with Azure Table Storage. I'm fine using a GUID, however, everyone seems to warn against using a GUID. Yet, I'm not sure what I should be using.
Can anyone provide me a pointer for what I should use as the RowKey for storing log data? I've seen the following syntax (RowKey: {'_': '1'}), as shown below, but can't find out what it means:
var task = {
PartitionKey: {'_':'hometasks'},
RowKey: {'_': '1'}
};
Thanks!
There are many approaches you can take. One such approach would be to store date/time value in ticks as RowKey. This would help you in fetching logs data for a particular time range. Just remember that since RowKey is of String data type, you may want to pre-pad it with zeros so that all values are of same length. For example,
DateTime.UtcNow.Ticks.ToString("d20")
With this, you could take 2 approaches:
Store them in chronological order as shown in example above.
Store them in reverse chronological order. The advantage of this approach is that the latest entries will always be added on top of the table. So you could just query the table on PartitionKey and get top 'x' rows and they will be latest. You will do something like:
(DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d20")
You also mentioned in your comment that I expect the data sets to be quite large.. I hope you are not using a single PartitionKey. Because if number of records are quite large and all of them are put in same partition, the performance might be impacted.

HBase schema design in storing query log

Recently, I'm working on make a solution for storing user's search log/query log into a HBase table.
Let's simple the raw Query log:
query timestamp req_cookie req_ip ...
Data access patterns:
scan through all querys within a time range.
scan through all search history with a specified query
I came up with the following row-key design:
<query>_<timestamp>
But the query may be very long or in different encoding, put query directly into the rowkey seems unwise.
I'm looking for help in optimizing this schema, anybody handling this scenario before?
1- You can do a full table scan with a timerange. In case you need realtime responses you have to maintain a reverse row-key table <timestamp>_<query> (plan your region splitting policy carefully first).
Be warned that sequential row key prefixes will get some of your
regions very hot if you have a lot of concurrence, so it would be wise
to buffer writes to that table. Additionally, if you get more writes than a single region can handle you're going to implement some sort of sharding prefix (i.e modulo of the timestamp), although this will make your
retrievals a lot more complex (you'll have to merge the results of
multiple scans).
2- Hash the query string in a way that you always have a fixed-length row key without having to care about encoding (MD5 maybe?)

How to get many table entities from Azure Table Storage with multiple PKs?

I have a bunch of primary keys - tens of thousands, and I want to retrieve their associated table entities. All row keys are empty strings. The best way I know of doing so, is querying them one by one async. It seems fast, but ideally I would like to bunch a few entities together in a single transaction. Playing with the new Storage Client, I have the following code failing:
var sample = GetSampleIds(); //10000 pks
var account = GetStorageAccount();
var tableClient = account.CreateCloudTableClient();
var table = tableClient.GetTableReference("myTable");
//I'm trying to get first and second pk in a single request.
var keyA = sample[0];
var keyB = sample[1];
var filterA = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyA);
var filterB = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyB));
//filterAB = "(PartitionKey eq 'keyA') or (PartitionKey eq 'keyB')"
var filterAB = TableQuery.CombineFilters(filterA, TableOperators.Or, filterB);
var query = new TableQuery<TweetEntity>().Where(filterAB);
//Does something weird. I thought it might be fetching a range at one point.
//Whatever it does it doesn't return. Expected the following line to get an array of 2 items.
table.ExecuteQuery(query).ToArray()
// replacing filterAB in query with either filterA or filterB works as expected
Examples always show CombineFilters working on PK and then RK, but this is of no use to me. I'm assuming that this is not possible.
Question
Is it possible to bundle together entities by PK? I know the maximum filter length is 15, but even 2 is a potential improvement when you are fetching 10,000 items. Also, where is the manual? Can't find proper documentation anywhere. For example MSDN for CombineFilters is a basic shell wrapping less information that intellisense provides.
tl;dr: sounds like you need to rethink your partitioning strategy. Unique, non-sequential IDs are not good PKs when you commonly have to query or work on many. More:
Partition Keys are not meant to be 'primary' keys really. They are more thought of as grouped, closely related sets of data that you want to work with. You can group by id, date, etc. PKs are used to scale the system - in theory, you could have 1 partition server per PK working on your data.
To your question: you won't get very good performance doing what you are doing. In fact, OR queries are non-optimized and will require a full table scan (bad). So, instead of doing PK = "foo" OR PK = "bar", you really should be doing 2 queries (in parallel) as that will get you much better performance.
Back to your core issue, if you are using a unique identifier for a particular entity and describing that as a PK, then it also means you are not able to be working on more than 1 entity at time. In order to work on entit(ies) you really need a common partition key. Can you think of a better one that describes your entities? Does date/time work? Some other common attribute? Those tend to be good partion keys. The only other thing you can do is what is called partition ranging - where your queries tend to be ranged on partition keys. An example of this is date-time partition keys. You can use file ticks to describe your partition and end up with sequential data ticks as PKs. Your query can then use > and < queries to specify a range (no OR). Those can be more optimized, but you will still get potentially a ton of continuation tokens.
As dunnry has mentioned in his reply, the problem with this approach is that OR queries are horribly slow. I got my problem to work without the storage client (at this point, I'm not sure what's wrong with it, let's say it's a bug maybe), but getting 2 entities separately without the OR query turns out to be much(!) faster than getting them with the OR query.

Resources