Azure Table Storage Batch Row Key Lookups

Azure Table Storage Batch Row Key Lookups - azure

I tried using CloudTable::ExecuteBatch(new TableBatchOperation{operation1, operation2});
Each operation was a Retrieve operation. The snippet in question looked like this:
var partitionKey = "1";
var operation1 = TableOperation.Retrieve(partitionKey, "1");
var operation2 = TableOperation.Retrieve(partitionKey, "2");
var executedResult = ExecuteBatch(new TableBatchOperation{operation1, operation2});
I got an exception saying there could not be any retrieve operations in a batch execution. Is there a way to pull this off or is an asynchronous execution the best way to handle multiple partition key, row key look ups? For my use case I will have to look up at most 3 different rows by partition key and row key at the same time.

Yes, batch operations have certain restrictions and do not include GETS.
You can try range queries as outlined here, if the partition key remains the same.
Windows Azure table access latency Partition keys and row keys selection
Otherwise, you can query in parallel.

Related

Is it possible rollback Azure storage table transactions so that I do not lose data?

I am fairly new to Azure cloud development.
I have a function app coded in C# that:
Gets a record from a storage table
Deletes that record
Updates fields on that record (including the partition key)
inserts the new record into the storage table
I am experiencing data loss, when an exception is thrown, on the insert portion.
I am wondering how, if step 4 throws an exception, I can then rollback step 2. If that is not possible how would I prevent the data loss, as I'm unable to use the built in Table Operations that would replace the entity because I am changing the partition key?
I understand that the hard part in all of this to be the partition key update, as I know the system was designed so that each transaction or operation is operating on records with the same partition key.
I have looked through the Table Service REST API and looked at all the Table Operations I thought could be helpful:
Insert Entity
Update Entity
Merge Entity
Insert or Update Entity
Insert or Replace Entity

You can't do transactions due to partition key. So you'll have to look at a solution outside of the table storage.
What you could do is create the record before deleting it. That way you're assured that you won't lose any data (as long as you make sure the request to create a record succeeded).
You could take it one step further by making it an async process. Having a storage queue or service bus queue up your message containing the information of the request and having a function app (or anything else) handle the requests. That way you can assure the request remains retryable in case any transient errors occur over a larger timespan.

As per question, we are able to reproduce the data loss issue.
In table, have below highlighted record.
Once exception occurred on Insert data got loss as mentioned in question
To update value of PartitionKey CosmosDb doesn't allow direct update to value of partitionkey. First we need to delete the record and then create new record with new partitionkey value.
To prevent data loss using built in TableOperations you can perform/call Execute() once prior steps got completed successfully.
TableOperation delOperation = TableOperation.Delete(getBatchCustomer);
You can clone or create a deep copy of first object using copy constructor.
public Customer(Customer customer)
{
PartitionKey = customer.PartitionKey;
RowKey = customer.RowKey;
customerName = customer.customerName;
}
Creating copy of the object
Customer c = new(getCustomer)
{
PartitionKey = "India"
};
In step 4 mentioned in question by you is completed successfully then we can commit delete operation.
Got exception on insert step
but when looked data table no data has been lost.
Below is code snippet to prevent data loss.
TableOperation _insOperation = TableOperation.Insert(c);
var insResult = _table.Execute(_insOperation);
if (insResult.HttpStatusCode == 204)
{
var delResult = _table.Execute(delOperation);
}

Azure Table Storage API returns 0 results with Continuation Token

We are using .net Azure storage client library to retrieve data from server. But when we try to retrieve data, the result have only 0 items with a continuation token. When we fetch the next page with this continuation token we again gets the same result. However when we use the 4th continuation token fetched like this, we are getting the proper result with 15 items.( The items count for all requests are 15). This issue is observed only when we tried applying filter conditions. The code used to fetch result is given below
var tableReference = _tableClient.GetTableReference(tableName);
var query = new TableQuery();
query.Where("'DeviceId' eq '99'"); // DeviceId is of type Int32
query.TakeCount = 15;
var resultsQuery = tableReference.ExecuteQuerySegmented(query, token);
var nextToken = resultsQuery.ContinuationToken;
var results = resultsQuery.ToList();

This is expected behavior. From Query Timeout and Pagination:
A query against the Table service may return a maximum of 1,000 items
at one time and may execute for a maximum of five seconds. If the
result set contains more than 1,000 items, if the query did not
complete within five seconds, or if the query crosses the partition
boundary, the response includes headers which provide the developer
with continuation tokens to use in order to resume the query at the
next item in the result set. Continuation token headers may be
returned for a Query Tables operation or a Query Entities operation.
I noticed that you're not using PartitionKey in your query. This will result in full table scan. Recommendation would be to always use PartitionKey (and possibly RowKey) in your queries to avoid full table scans. I would highly recommend reading Azure Storage Table Design Guide: Designing Scalable and Performant Tables to get the most out of Azure Tables.
UPDATE: Explaining "If the query crosses the partition boundary"
Let me try with an example as to what I understand by Partition Bounday. Let's assume you have 1 million rows in your table evenly spread across 10 Partitions (let's assume your PartitionKeys are 001, 002, 003,...010). Now we know that the data in Azure Tables is organized by PartitionKey and then in a Partition by RowKey. Since in your query you did not specify PartitionKey, Table Service starts from 1st Partition (i.e. PartitionKey == 001) and tries to find the matching data there. If it does not find any data in that Partition, it does not know whether the data is there in another Partition so instead of going to the next Partition, it simply returns back with a continuation token and leave it to the client consuming the API to decide whether they want to continue the search using the same parameters + continuation token or revise their search to start again.

Shared Access Signature for Azure Table Storage with PK and RK ranges

I have a situation where I need to create a SAS token based on a range of PartitionKeys and RowKeys both.
To be more precise, my PK is based on Ticks of timestamp (there is a partion for every 10-minute range). My RK is based on some string.
I'm trying to call storage from a browser and get data for a range of PKs (based on some time range) and within those PK's, based for a range of some RKs. IE:
PK > 100000000 && PK < 200000000 && RK > "aaa" && RK <"mmm"
When I create the token, response from storage returns correct partitions, but entities for all RK's.
var sas = table.GetSharedAccessSignature(new SharedAccessTablePolicy
{
Permissions = SharedAccessTablePermissions.Query,
SharedAccessExpiryTime = DateTime.UtcNow.Add(period)
}, null, startPk, startRk, endPk, endRk);
Any ideas how to make the call only follow provided RK range without me having to filter out on the client unnecessary entities?

#GauravMantri pointed me to a helpful article: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/06/12/introducing-table-sas-shared-access-signature-queue-sas-and-update-to-blob-sas.aspx
What I was trying to do is not supported. PK/RK ranges are given for a continous range from start PK/RK to end PK/RK, rather than a filter query as I have thought.

Note that your partition key pattern may result in bad performance because of the "Append Only" pattern that you are setting up: https://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/#subheading28
Azure Storage learns your usage pattern, and adjusts the partition distribution according to the load, adaptively. So if you have your load across a number of partition keys, then it can split those partitions into different servers internally, to balance your load. However, if you load is all on one partition, and that partition changes periodically (like it does with append-only pattern), then the adaptive load balancing logic gets ineffective. To avoid this, you should avoid using dates or date-times as your partition key, if your query patterns allow it.

Strategy for storing application logs in Azure Table Storage

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.

Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.

Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

How to get many table entities from Azure Table Storage with multiple PKs?

I have a bunch of primary keys - tens of thousands, and I want to retrieve their associated table entities. All row keys are empty strings. The best way I know of doing so, is querying them one by one async. It seems fast, but ideally I would like to bunch a few entities together in a single transaction. Playing with the new Storage Client, I have the following code failing:
var sample = GetSampleIds(); //10000 pks
var account = GetStorageAccount();
var tableClient = account.CreateCloudTableClient();
var table = tableClient.GetTableReference("myTable");
//I'm trying to get first and second pk in a single request.
var keyA = sample[0];
var keyB = sample[1];
var filterA = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyA);
var filterB = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyB));
//filterAB = "(PartitionKey eq 'keyA') or (PartitionKey eq 'keyB')"
var filterAB = TableQuery.CombineFilters(filterA, TableOperators.Or, filterB);
var query = new TableQuery<TweetEntity>().Where(filterAB);
//Does something weird. I thought it might be fetching a range at one point.
//Whatever it does it doesn't return. Expected the following line to get an array of 2 items.
table.ExecuteQuery(query).ToArray()
// replacing filterAB in query with either filterA or filterB works as expected
Examples always show CombineFilters working on PK and then RK, but this is of no use to me. I'm assuming that this is not possible.
Question
Is it possible to bundle together entities by PK? I know the maximum filter length is 15, but even 2 is a potential improvement when you are fetching 10,000 items. Also, where is the manual? Can't find proper documentation anywhere. For example MSDN for CombineFilters is a basic shell wrapping less information that intellisense provides.

tl;dr: sounds like you need to rethink your partitioning strategy. Unique, non-sequential IDs are not good PKs when you commonly have to query or work on many. More:
Partition Keys are not meant to be 'primary' keys really. They are more thought of as grouped, closely related sets of data that you want to work with. You can group by id, date, etc. PKs are used to scale the system - in theory, you could have 1 partition server per PK working on your data.
To your question: you won't get very good performance doing what you are doing. In fact, OR queries are non-optimized and will require a full table scan (bad). So, instead of doing PK = "foo" OR PK = "bar", you really should be doing 2 queries (in parallel) as that will get you much better performance.
Back to your core issue, if you are using a unique identifier for a particular entity and describing that as a PK, then it also means you are not able to be working on more than 1 entity at time. In order to work on entit(ies) you really need a common partition key. Can you think of a better one that describes your entities? Does date/time work? Some other common attribute? Those tend to be good partion keys. The only other thing you can do is what is called partition ranging - where your queries tend to be ranged on partition keys. An example of this is date-time partition keys. You can use file ticks to describe your partition and end up with sequential data ticks as PKs. Your query can then use > and < queries to specify a range (no OR). Those can be more optimized, but you will still get potentially a ton of continuation tokens.

As dunnry has mentioned in his reply, the problem with this approach is that OR queries are horribly slow. I got my problem to work without the storage client (at this point, I'm not sure what's wrong with it, let's say it's a bug maybe), but getting 2 entities separately without the OR query turns out to be much(!) faster than getting them with the OR query.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string