Azure Table query with only RowKey - azure

I have a table with ~10,000 PartitionKey (PK) and each PK has ~500,000 RowKey (RK) as "yyyyMMddHHmmss"
short
When I try to retrieve records with "yyyyMMdd" formatted RK without PK, it takes forever (literally) to get results.
long
My queries are mostly PK + RK unfortunately some queries must be retrieved by RK only.
I understand that retrieving data without PK is not the best approach but I have to.
And it looks like this is not an option at all in real life scenario.
The only way I can think of is keeping another table to save PKs based on RK's which I really don't want to keep reference table unless it is absolutely the only way to handle this)
code
CloudStorageAccount account = CloudStorageAccount.DevelopmentStorageAccount;
CloudTableClient tableClient = account.CreateCloudTableClient();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
var query = new TableQuery<TestEntity>().Where("(RowKey ge '20050103') and (RowKey lt '20050104')");
var result = table.ExecuteQuery(query);
Debug.WriteLine(result.Count());

As you have correctly noted, for such volume of data, retrieval without a PK is not feasible.
If retrieval by date-only is necessary, as painful as it sounds, I would say that the design of your table schema is flawed and needs to be redesigned. If you wish to explain all of the usage scenarios for the data in the table, perhaps folks here can help with the proper design of the table?

Related

CosmosDb table retrieve multiple records at once

I am looking at doing multiple point read operations at a time (they do belong to the same partition, and I do not intend retrieving more than 100 at a time). The requirements are really quite simple. The Cloud table does not support retrieving more than one entity at a time, and I am really at my wits end about how to proceed.
I could create a Table query with the partition key, and all the row keys that are of interest, but that really seems like an overkill. I know exactly what I am looking for. I also do not want to end up scanning the entire partition.
This is what I have done. I however do not know if the CloudTable client is thread safe.
List<Task<TableResult>> taskList = new List<Task<TableResult>>();
CloudTable cloudTable = ...;
foreach (T entity in readContainer.Entities)
{
taskList.Add(cloudTable.ExecuteAsync(TableOperation.Retrieve<T>
(entity.PartitionKey,
entity.RowKey)));
}
Task.WaitAll(taskList.ToArray());
IList<TableResult> results = new List<TableResult>();
foreach (Task<TableResult> task in taskList)
{
results.Add(task.Result);
}
CloudTable is thread safe.
CloudTable batch API can be leveraged in above scenarios. Please check https://learn.microsoft.com/en-us/dotnet/api/microsoft.windowsazure.storage.table.cloudtable.executebatch?view=azure-dotnet

What is the best way of creating Partition key in azure table storage for storing sensor output data?

I searched best practice for storing sensor output data in azure table storage but didn't get best answer. I am currently working on a project that consists of storing sensor data to azure table storage. Currently I am using partition key as Sensor ID . Every second I am storing the sensor outputs. About 100 sensors are currently using. So imagine large data is storing every day. So I am getting slow performance in my web application when i searched particular sensor data by date wise. Is there a better way to improve the performance of the web app? How about changing sensor id to date as partition key? Code is not important here. I need a logical solution.. May be this question will help lot of developers who are working on such scenario..
UPDATE
Each sensor provides a 10 different outputs and date which is the output datetime. So they are in a same row of each sensor id. And I am taking sensor data using Date range and Sensor id
Partition key - sensor id , RowKey - datetime , 10 output columns and output date
here is my code
var query = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sensorID);
var dateFilter = TableQuery.CombineFilters(
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.GreaterThanOrEqual, Convert.ToDateTime(from)),
TableOperators.And,
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.LessThanOrEqual, Convert.ToDateTime(to))
);
query = TableQuery.CombineFilters(query, TableOperators.And, dateFilter);
var rangeQuery = new TableQuery<TotalizerTableEntity>().Where(query);
var entitys = table.ExecuteQuery(rangeQuery).OrderBy(j => j.date).ToList();
outputdate indicates output generated time. This is getting as datetime. All output have same output time.
First, I would highly recommend that you read Azure Storage Table Design Guide: Designing Scalable and Performant Tables. This will give you a lot of ideas about how to structure your data.
Now coming to your current implementation. What I am noticing is that you're including PartitionKey in your query (which is very good BTW) but then adding a non-indexed attribute (outputdate) in your query as well. This will result in what is known is Partition Scan. For larger tables, this will create a problem because your query will be scanning the entire partition for matching outputdate attribute.
You mentioned that you're storing datetime value is RowKey. Assuming the RowKey value matches with the value of output date, I would recommend using RowKey in your query instead of this non-indexed attribute. RowKey (along with PartitionKey) are the only two attributes that are indexed in a table, so the query will be comparatively much faster.
When saving date/time as RowKey, I would recommend converting it into ticks (DateTime.Ticks) and then saving that instead of simply converting the date/time value to string. If you're going with this approach, I would suggest prepending 0 in front of this ticks so that all values are of same length (doing something like DateTime.Ticks.ToString("d19")).
You can also save the RowKey as Reverse Ticks i.e. (DateTime.MaxValue.Ticks - DateTime.Ticks).ToString("d20"). This will ensure that all the latest entries get added to the top of the table instead of at the bottom. This will help in scenario where you are more interested in querying the latest records.
If you will always query for a particular sensor, it may not hurt to save data for each sensor in a separate table i.e. each sensor gets a separate table. This will free up one key for you. You can use date/time value (which you're currently storing as RowKey) as PartitionKey and can use some other value as RowKey. Furthermore, it will allow you to scale across storage accounts - data for some sensors will go in one storage account while the data for other sensors will go in other storage account. Somewhere you just need to save this relationship so that data reaches correct storage account/table.

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

Azure Table Storage Batch Row Key Lookups

I tried using CloudTable::ExecuteBatch(new TableBatchOperation{operation1, operation2});
Each operation was a Retrieve operation. The snippet in question looked like this:
var partitionKey = "1";
var operation1 = TableOperation.Retrieve(partitionKey, "1");
var operation2 = TableOperation.Retrieve(partitionKey, "2");
var executedResult = ExecuteBatch(new TableBatchOperation{operation1, operation2});
I got an exception saying there could not be any retrieve operations in a batch execution. Is there a way to pull this off or is an asynchronous execution the best way to handle multiple partition key, row key look ups? For my use case I will have to look up at most 3 different rows by partition key and row key at the same time.
Yes, batch operations have certain restrictions and do not include GETS.
You can try range queries as outlined here, if the partition key remains the same.
Windows Azure table access latency Partition keys and row keys selection
Otherwise, you can query in parallel.

How to get many table entities from Azure Table Storage with multiple PKs?

I have a bunch of primary keys - tens of thousands, and I want to retrieve their associated table entities. All row keys are empty strings. The best way I know of doing so, is querying them one by one async. It seems fast, but ideally I would like to bunch a few entities together in a single transaction. Playing with the new Storage Client, I have the following code failing:
var sample = GetSampleIds(); //10000 pks
var account = GetStorageAccount();
var tableClient = account.CreateCloudTableClient();
var table = tableClient.GetTableReference("myTable");
//I'm trying to get first and second pk in a single request.
var keyA = sample[0];
var keyB = sample[1];
var filterA = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyA);
var filterB = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyB));
//filterAB = "(PartitionKey eq 'keyA') or (PartitionKey eq 'keyB')"
var filterAB = TableQuery.CombineFilters(filterA, TableOperators.Or, filterB);
var query = new TableQuery<TweetEntity>().Where(filterAB);
//Does something weird. I thought it might be fetching a range at one point.
//Whatever it does it doesn't return. Expected the following line to get an array of 2 items.
table.ExecuteQuery(query).ToArray()
// replacing filterAB in query with either filterA or filterB works as expected
Examples always show CombineFilters working on PK and then RK, but this is of no use to me. I'm assuming that this is not possible.
Question
Is it possible to bundle together entities by PK? I know the maximum filter length is 15, but even 2 is a potential improvement when you are fetching 10,000 items. Also, where is the manual? Can't find proper documentation anywhere. For example MSDN for CombineFilters is a basic shell wrapping less information that intellisense provides.
tl;dr: sounds like you need to rethink your partitioning strategy. Unique, non-sequential IDs are not good PKs when you commonly have to query or work on many. More:
Partition Keys are not meant to be 'primary' keys really. They are more thought of as grouped, closely related sets of data that you want to work with. You can group by id, date, etc. PKs are used to scale the system - in theory, you could have 1 partition server per PK working on your data.
To your question: you won't get very good performance doing what you are doing. In fact, OR queries are non-optimized and will require a full table scan (bad). So, instead of doing PK = "foo" OR PK = "bar", you really should be doing 2 queries (in parallel) as that will get you much better performance.
Back to your core issue, if you are using a unique identifier for a particular entity and describing that as a PK, then it also means you are not able to be working on more than 1 entity at time. In order to work on entit(ies) you really need a common partition key. Can you think of a better one that describes your entities? Does date/time work? Some other common attribute? Those tend to be good partion keys. The only other thing you can do is what is called partition ranging - where your queries tend to be ranged on partition keys. An example of this is date-time partition keys. You can use file ticks to describe your partition and end up with sequential data ticks as PKs. Your query can then use > and < queries to specify a range (no OR). Those can be more optimized, but you will still get potentially a ton of continuation tokens.
As dunnry has mentioned in his reply, the problem with this approach is that OR queries are horribly slow. I got my problem to work without the storage client (at this point, I'm not sure what's wrong with it, let's say it's a bug maybe), but getting 2 entities separately without the OR query turns out to be much(!) faster than getting them with the OR query.

Resources