CosmosDb table retrieve multiple records at once

CosmosDb table retrieve multiple records at once - multithreading

I am looking at doing multiple point read operations at a time (they do belong to the same partition, and I do not intend retrieving more than 100 at a time). The requirements are really quite simple. The Cloud table does not support retrieving more than one entity at a time, and I am really at my wits end about how to proceed.
I could create a Table query with the partition key, and all the row keys that are of interest, but that really seems like an overkill. I know exactly what I am looking for. I also do not want to end up scanning the entire partition.
This is what I have done. I however do not know if the CloudTable client is thread safe.
List<Task<TableResult>> taskList = new List<Task<TableResult>>();
CloudTable cloudTable = ...;
foreach (T entity in readContainer.Entities)
{
taskList.Add(cloudTable.ExecuteAsync(TableOperation.Retrieve<T>
(entity.PartitionKey,
entity.RowKey)));
}
Task.WaitAll(taskList.ToArray());
IList<TableResult> results = new List<TableResult>();
foreach (Task<TableResult> task in taskList)
{
results.Add(task.Result);
}

CloudTable is thread safe.
CloudTable batch API can be leveraged in above scenarios. Please check https://learn.microsoft.com/en-us/dotnet/api/microsoft.windowsazure.storage.table.cloudtable.executebatch?view=azure-dotnet

Related

Is it possible rollback Azure storage table transactions so that I do not lose data?

I am fairly new to Azure cloud development.
I have a function app coded in C# that:
Gets a record from a storage table
Deletes that record
Updates fields on that record (including the partition key)
inserts the new record into the storage table
I am experiencing data loss, when an exception is thrown, on the insert portion.
I am wondering how, if step 4 throws an exception, I can then rollback step 2. If that is not possible how would I prevent the data loss, as I'm unable to use the built in Table Operations that would replace the entity because I am changing the partition key?
I understand that the hard part in all of this to be the partition key update, as I know the system was designed so that each transaction or operation is operating on records with the same partition key.
I have looked through the Table Service REST API and looked at all the Table Operations I thought could be helpful:
Insert Entity
Update Entity
Merge Entity
Insert or Update Entity
Insert or Replace Entity

You can't do transactions due to partition key. So you'll have to look at a solution outside of the table storage.
What you could do is create the record before deleting it. That way you're assured that you won't lose any data (as long as you make sure the request to create a record succeeded).
You could take it one step further by making it an async process. Having a storage queue or service bus queue up your message containing the information of the request and having a function app (or anything else) handle the requests. That way you can assure the request remains retryable in case any transient errors occur over a larger timespan.

As per question, we are able to reproduce the data loss issue.
In table, have below highlighted record.
Once exception occurred on Insert data got loss as mentioned in question
To update value of PartitionKey CosmosDb doesn't allow direct update to value of partitionkey. First we need to delete the record and then create new record with new partitionkey value.
To prevent data loss using built in TableOperations you can perform/call Execute() once prior steps got completed successfully.
TableOperation delOperation = TableOperation.Delete(getBatchCustomer);
You can clone or create a deep copy of first object using copy constructor.
public Customer(Customer customer)
{
PartitionKey = customer.PartitionKey;
RowKey = customer.RowKey;
customerName = customer.customerName;
}
Creating copy of the object
Customer c = new(getCustomer)
{
PartitionKey = "India"
};
In step 4 mentioned in question by you is completed successfully then we can commit delete operation.
Got exception on insert step
but when looked data table no data has been lost.
Below is code snippet to prevent data loss.
TableOperation _insOperation = TableOperation.Insert(c);
var insResult = _table.Execute(_insOperation);
if (insResult.HttpStatusCode == 204)
{
var delResult = _table.Execute(delOperation);
}

Efficiently retrieving large numbers of entities from Azure Table Storage

What are some ways to optimize the retrieval of large numbers of entities (~250K) from a single partition from Azure Table Storage to a .NET application?

As far as I know, there are two ways to optimize the retrieval of large numbers of entities from a single partition from Azure Table Storage to a .NET application.
1.If you don’t need to get all properties of the entity, I suggest you could use server-side projection.
A single entity can have up to 255 properties and be up to 1 MB in size. When you query the table and retrieve entities, you may not need all the properties and can avoid transferring data unnecessarily (to help reduce latency and cost). You can use server-side projection to transfer just the properties you need.
From:Azure Storage Table Design Guide: Designing Scalable and Performant Tables（Server-side projection）
More details, you could refer to follow codes:
string filter = TableQuery.GenerateFilterCondition(
"PartitionKey", QueryComparisons.Equal, "Sales");
List<string> columns = new List<string>() { "Email" };
TableQuery<EmployeeEntity> employeeQuery =
new TableQuery<EmployeeEntity>().Where(filter).Select(columns);
var entities = employeeTable.ExecuteQuery(employeeQuery);
foreach (var e in entities)
{
Console.WriteLine("RowKey: {0}, EmployeeEmail: {1}", e.RowKey, e.Email);
}
2.If you just want to show the table’s message, you needn’t to get all the entities at same time.
You could get part of the result.
If you want to get the other result, you could use the continuation token.
This will improve the table query performance.
A query against the table service may return a maximum of 1,000 entities at one time and may execute for a maximum of five seconds. If the result set contains more than 1,000 entities, if the query did not complete within five seconds, or if the query crosses the partition boundary, the Table service returns a continuation token to enable the client application to request the next set of entities. For more information about how continuation tokens work, see Query Timeout and Pagination.
From:Azure Storage Table Design Guide: Designing Scalable and Performant Tables(Retrieving large numbers of entities from a query)
By using continuation tokens explicitly, you can control when your application retrieves the next segment of data.
More details, you could refer to follow codes:
string filter = TableQuery.GenerateFilterCondition(
"PartitionKey", QueryComparisons.Equal, "Sales");
TableQuery<EmployeeEntity> employeeQuery =
new TableQuery<EmployeeEntity>().Where(filter);
TableContinuationToken continuationToken = null;
do
{
var employees = employeeTable.ExecuteQuerySegmented(
employeeQuery, continuationToken);
foreach (var emp in employees)
{
...
}
continuationToken = employees.ContinuationToken;
} while (continuationToken != null);
Besides, I suggest you could pay attention to the table partition scalability targets.
Target throughput for single table partition (1 KB entities) Up to 2000 entities per second
If you reach the scalability targets for this partition, the storage service will throttle.

Azure Table query with only RowKey

I have a table with ~10,000 PartitionKey (PK) and each PK has ~500,000 RowKey (RK) as "yyyyMMddHHmmss"
short
When I try to retrieve records with "yyyyMMdd" formatted RK without PK, it takes forever (literally) to get results.
long
My queries are mostly PK + RK unfortunately some queries must be retrieved by RK only.
I understand that retrieving data without PK is not the best approach but I have to.
And it looks like this is not an option at all in real life scenario.
The only way I can think of is keeping another table to save PKs based on RK's which I really don't want to keep reference table unless it is absolutely the only way to handle this)
code
CloudStorageAccount account = CloudStorageAccount.DevelopmentStorageAccount;
CloudTableClient tableClient = account.CreateCloudTableClient();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
var query = new TableQuery<TestEntity>().Where("(RowKey ge '20050103') and (RowKey lt '20050104')");
var result = table.ExecuteQuery(query);
Debug.WriteLine(result.Count());

As you have correctly noted, for such volume of data, retrieval without a PK is not feasible.
If retrieval by date-only is necessary, as painful as it sounds, I would say that the design of your table schema is flawed and needs to be redesigned. If you wish to explain all of the usage scenarios for the data in the table, perhaps folks here can help with the proper design of the table?

TransactionScope and Azure Table Storage

Is there an equivalent to TransactionScope that you can use with Azure Table Storage?
what I'm trying to do is the following:
using (TransactionScope scope = new TransactionScope) {
account.balance -= 10;
purchaseOrders.Add(order);
accountDataSource.SaveChanges();
purchaseOrdersDataSource.SaveChanges();
scope.Complete();
}
If for some reason saving the account works, but saving the purchase order fails, I don't want the account to decrement the balance.

Within a single table and single partition, you may write multiple rows in an entity group transaction. There's no built-in transaction mechanism when crossing partitions or tables.
that said: remember that tables are schema-less, so if you really needed a transaction, you could store both your account row and your purchase order row in the same table, same partition, and do a single (transactional) save.

Dealing with deadlocks in long-running Hibernate transactions

I have a Hibernate application that may produce concurrent inserts and updates (via Session.saveOrUpdate) to records with the same primary key, which is assigned. These transactions are somewhat long-running, perhaps 15 seconds on average (since data is collected from remote sources and persisted as it comes in). My DB isolation level is set to Read Committed, and I'm using MySQL and InnoDB.
The problem is this scenario creates excessive lock waits which timeout, either as a result of a deadlock or the long transactions. This leads me to a few questions:
Does the database engine only release its locks when the transaction is committed?
If this is the case, should I seek to shorten my transactions?
If so, would it be a good practice to use separate read and write transactions, where the write transaction could be made short and only take place after all of my data is gathered (the bulk of my transaction length involves collecting remote data).
Edit:
Here's a simple test that approximates what I believe is happening. Since I'm dealing with long running transactions, commit takes place long after the first flush. So just to illustrate my situation I left commit out of the test:
#Entity
static class Person {
#Id
Long id = Long.valueOf(1);
#Version
private int version;
}
#Test
public void updateTest() {
for (int i = 0; i < 5; i++) {
new Thread() {
public void run() {
Session s = sf.openSession();
Transaction t = s.beginTransaction();
Person p = new Person();
s.saveOrUpdate(p);
s.flush(); // Waits...
}
}.run();
}
}
And the queries that this expectantly produces, waiting on the second insert:
select id, version from person where id=?
insert into person (version, id) values (?, ?)
select id, version from person where id=?
insert into person (version, id) values (?, ?)

That's correct, the database releases locks only when the transaction is committed. Since you're using hibernate, you can use optimistic locking, which does locks the database for long periods of time. Essentially, hibernate does what you suggest, separating the reading and writing portions into separate transactions. On write, it checks that the data in memory has not been changed concurrently in the database.
Hibernate Reference - Optimistic Transactions

Opportunistic locking:
Base assumption: update conflicts do occur seldom.
Mechanic:
Read dataset with version field
Change dataset
Update dataset
3.1.Read Dataset with current Version field and key
If you get it, nobody has changed the record.
Apply the next version field value.
update the record.
If you do not get it, the record has been changed, return en aproriate message to the caller and you are done
Inserts are not affected, you either
have a separate primary key anyway
or you accept multiple record with identical values.
Therefore the example given above is not a case for optimistic locking.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

CosmosDb table retrieve multiple records at once - multithreading

CloudTable is thread safe. CloudTable batch API can be leveraged in above scenarios. Please check https://learn.microsoft.com/en-us/dotnet/api/microsoft.windowsazure.storage.table.cloudtable.executebatch?view=azure-dotnet

Related

Is it possible rollback Azure storage table transactions so that I do not lose data?

Efficiently retrieving large numbers of entities from Azure Table Storage

Azure Table query with only RowKey

TransactionScope and Azure Table Storage

Dealing with deadlocks in long-running Hibernate transactions

Categories

Resources