Azure Table Storage batch insert with potentially pre-existing rowkeys

Azure Table Storage batch insert with potentially pre-existing rowkeys - azure

I'm trying to send a simple batch of Insert operations to Azure Table Storage but it seems that the whole batch transaction is invalidated and, using the managed azure storage client, the ExecuteBatch method itself throws an Exception if there is a single Insert in the batch to a pre-existing record. (using 2.0 client):
public class SampleEntity : TableEntity
{
public SampleEntity(string partKey, string rowKey)
{
this.PartitionKey = partKey;
this.RowKey = rowKey;
}
}
var acct = CloudStorageAccount.DevelopmentStorageAccount;
var client = acct.CreateCloudTableClient();
var table = client.GetTableReference("SampleEntities");
var foo = new SampleEntity("partition1", "preexistingKey");
var bar = new SampleEntity("partition1", "newKey");
var batchOp = new TableBatchOperation();
batchOp.Add(TableOperation.Insert(foo));
batchOp.Add(TableOperation.Insert(bar));
var result = table.ExecuteBatch(batchOp); // throws exception: "0:The specified entity already exists."
The batch-level exception is avoided by using InsertOrMerge but then every individual operation response returns a 204, whether or not that particular operation inserted or merged it. So it seems its impossible for the client application to retain knowledge of whether it, or another node in the cluster, inserted the record. Unforunately, in my current case, this knowledge is necessary for some downstream synchronization.
Is there some configuration or technique to allow the batch of inserts to proceed and return the particular response code per-item without throwing a blanket exception?

As you already know, since batch is a transaction operation you get an all-or-none kind of a deal. One thing interesting with batch transactions is that you get an index of first failed entity in the batch. So assuming you're trying to insert 100 entities in a batch and 50th entity is already present in the table, the batch operation will give you the index of failed entity (49 in this case).
Is there some configuration or technique to allow the batch of inserts
to proceed and return the particular response code per-item without
throwing a blanket exception?
I don't think so. The transaction would fail as soon as the first entity fails. It will not even attempt to process other entities.
Possible Solutions (Just thinking out loud :))
If I understand correctly, your key requirement is to identify if an entity was inserted or merged (or replaced). For this the approach would be to separate out failed entities from a batch and process them separately. Based on this, I can think of two approaches:
What you could possibly do in this case is split that batch into 3
batches: 1st batch will contain 49 entities, 2nd batch will contain
just 1 entity (which failed) and the 3rd batch will contain 50
entities. You could now insert all entities in the 1st batch, decide
what you want to do with that failed entity and try to insert the
3rd batch. You would need to repeat the process over and over again
till the time this operation is complete.
Another idea would be to remove the failed entity from the batch and
retry that batch. So in the example above, in your 1st attempt
you'll try with 100 entities, in your 2nd attempt you'll try with 99
entities and so on and so forth keeping track of failed entities all
the while (with the reason as to why they failed). Once the batch
operation is successfully completed, you can work with all the
failed entities.

Related

Is it possible rollback Azure storage table transactions so that I do not lose data?

I am fairly new to Azure cloud development.
I have a function app coded in C# that:
Gets a record from a storage table
Deletes that record
Updates fields on that record (including the partition key)
inserts the new record into the storage table
I am experiencing data loss, when an exception is thrown, on the insert portion.
I am wondering how, if step 4 throws an exception, I can then rollback step 2. If that is not possible how would I prevent the data loss, as I'm unable to use the built in Table Operations that would replace the entity because I am changing the partition key?
I understand that the hard part in all of this to be the partition key update, as I know the system was designed so that each transaction or operation is operating on records with the same partition key.
I have looked through the Table Service REST API and looked at all the Table Operations I thought could be helpful:
Insert Entity
Update Entity
Merge Entity
Insert or Update Entity
Insert or Replace Entity

You can't do transactions due to partition key. So you'll have to look at a solution outside of the table storage.
What you could do is create the record before deleting it. That way you're assured that you won't lose any data (as long as you make sure the request to create a record succeeded).
You could take it one step further by making it an async process. Having a storage queue or service bus queue up your message containing the information of the request and having a function app (or anything else) handle the requests. That way you can assure the request remains retryable in case any transient errors occur over a larger timespan.

As per question, we are able to reproduce the data loss issue.
In table, have below highlighted record.
Once exception occurred on Insert data got loss as mentioned in question
To update value of PartitionKey CosmosDb doesn't allow direct update to value of partitionkey. First we need to delete the record and then create new record with new partitionkey value.
To prevent data loss using built in TableOperations you can perform/call Execute() once prior steps got completed successfully.
TableOperation delOperation = TableOperation.Delete(getBatchCustomer);
You can clone or create a deep copy of first object using copy constructor.
public Customer(Customer customer)
{
PartitionKey = customer.PartitionKey;
RowKey = customer.RowKey;
customerName = customer.customerName;
}
Creating copy of the object
Customer c = new(getCustomer)
{
PartitionKey = "India"
};
In step 4 mentioned in question by you is completed successfully then we can commit delete operation.
Got exception on insert step
but when looked data table no data has been lost.
Below is code snippet to prevent data loss.
TableOperation _insOperation = TableOperation.Insert(c);
var insResult = _table.Execute(_insOperation);
if (insResult.HttpStatusCode == 204)
{
var delResult = _table.Execute(delOperation);
}

Service Fabric - (reaching MaxReplicationMessageSize) Huge amount of data in a reliable dictionary

EDIT question summary:
I want to expose an endpoints, that will be capable of returning portions of xml data by some query parameters.
I have a statefull service (that is keeping the converted to DTOs xml data into a reliable dictionary)
I use a single, named partition (I just cant tell which partition holds the data by the query parameters passed, so I cant implement some smarter partitioning strategy)
I am using service remoting for communication between the stateless WEBAPI service and the statefull one
XML data may reach 500 MB
Everything is OK when the XML only around 50 MB
When data gets larger I Service Fabric complaining about MaxReplicationMessageSize
and the summary of my few questions from below: how can one achieve storing large amount of data into a reliable dictionary?
TL DR;
Apparently, I am missing something...
I want to parse, and load into a reliable dictionary huge XMLs for later queries over them.
I am using a single, named partition.
I have a XMLData stateful service that is loading this xmls into a reliable dictionary in its RunAsync method via this peace of code:
var myDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, List<HospitalData>>>("DATA");
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myDictionary.TryGetValueAsync(tx, "data");
ServiceEventSource.Current.ServiceMessage(this, "data status: {0}",
result.HasValue ? "loaded" : "not loaded yet, starts loading");
if (!result.HasValue)
{
Stopwatch timer = new Stopwatch();
timer.Start();
var converter = new DataConverter(XmlFolder);
List <Data> data = converter.LoadData();
await myDictionary.AddOrUpdateAsync(tx, "data", data, (key, value) => data);
timer.Stop();
ServiceEventSource.Current.ServiceMessage(this,
string.Format("Loading of data finished in {0} ms",
timer.ElapsedMilliseconds));
}
await tx.CommitAsync();
}
I have a stateless WebApi service that is communicating with the above stateful one via service remoting and querying the dictionary via this code:
ServiceUriBuilder builder = new ServiceUriBuilder(DataServiceName);
DataService DataServiceClient = ServiceProxy.Create<IDataService>(builder.ToUri(),
new Microsoft.ServiceFabric.Services.Client.ServicePartitionKey("My.single.named.partition"));
try
{
var data = await DataServiceClient.QueryData(SomeQuery);
return Ok(data);
}
catch (Exception ex)
{
ServiceEventSource.Current.Message("Web Service: Exception: {0}", ex);
throw;
}
It works really well when the XMLs do not exceeds 50 MB.
After that I get errors like:
System.Fabric.FabricReplicationOperationTooLargeException: The replication operation is larger than the configured limit - MaxReplicationMessageSize ---> System.Runtime.InteropServices.COMException
Questions:
I am almost certain that it is about the partitioning strategy and I need to use more partitions. But how to reference a particular partition while in the context of the RunAsync method of the Stateful Service? (Stateful service, is invoked via the RPC in WebApi where I explicitly point out a partition, so in there I can easily chose among partitions if using the Ranged partitions strategy - but how to do that while the initial loading of data when in the Run Async method)
Are these thoughts of mine correct: the code in a stateful service is operating on a single partition, thus Loading of huge amount of data and the partitioning of that data should happen outside the stateful service (like in an Actor). Then, after determining the partition key I just invoke the stateful service via RPC and pointing it to this particular partition
Actually is it at all a partitioning problem and what (where, who) is defining the Size of a Replication Message? I.e is the partiotioning strategy influencing the Replication Message sizes?
Would excerpting the loading logic into a stateful Actor help in any way?
For any help on this - thanks a lot!

The issue is that you're trying to add a large amount of data into a single dictionary record. When Service Fabric tries to replicate that data to other replicas of the service, it encounters a quota of the replicator, MaxReplicationMessageSize, which indeed defaults to 50MB (documented here).
You can increase the quota by specifying a ReliableStateManagerConfiguration:
internal sealed class Stateful1 : StatefulService
{
public Stateful1(StatefulServiceContext context)
: base(context, new ReliableStateManager(context,
new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
{
MaxReplicationMessageSize = 1024 * 1024 * 200
}))) { }
}
But I strongly suggest you change the way you store your data. The current method won't scale very well and isn't the way Reliable Collections were meant to be used.
Instead, you should store each HospitalData in a separate dictionary item. Then you can query the items in the dictionary (see this answer for details on how to use LINQ). You will not need to change the above quota.
PS - You don't necessarily have to use partitioning for 500MB of data. But regarding your question - you could use partitions even if you can't derive the key from the query, simply by querying all partitions and then combining the data.

What is the solution of multi table ACID transaction in cassandra

I was following this link to use a batch transaction without using BATCH keyword.
Cluster cluster = Cluster.builder()
.addContactPoint(“127.0.0.1")
.build();
Session session = cluster.newSession();
//Save off the prepared statement you're going to use
PreparedStatement statement = session.prepare(“INSERT INTO tester.users (userID, firstName, lastName) VALUES (?,?,?)”);
//
List<ResultSetFuture> futures = new ArrayList<ResultSetFuture>();
for (int i = 0; i < 1000; i++) {
//please bind with whatever actually useful data you're importing
BoundStatement bind = statement.bind(i, “John”, “Tester”);
ResultSetFuture resultSetFuture = session.executeAsync(bind);
futures.add(resultSetFuture);
}
//not returning anything useful but makes sure everything has completed before you exit the thread.
for(ResultSetFuture future: futures){
future.getUninterruptibly();
}
cluster.close();
My question is with the given approach is it possible to INSERT, UPDATE or DELETE data from different table and if any of those fail all should be failed by maintaining the same performance (as described in the link).
With this approach what i tried, i was trying to insert, delete data from different table and one query got failed so all previous query was executed and updated the db.
With BATCH I can see that if any statement get failed all statement will be failed. But using BATCH on different table is anti-pattern so what is the solution ?

With BATCH I can see that if any statement get failed all statement will be failed.
Wrong, the guarantee of LOGGED BATCH is: if some statements in the batch fail, they will be retried until the succeed.
But using BATCH on different table is anti-pattern so what is the solution ?
ACID transaction is not possible with Cassandra, it would require some sort of global lock or global coordination and be prohibitive performance-wise.
However, if you don't care about the performance cost, you can implement your self a global lock/lease system using Light Weight Transaction primitives as described here
But be ready to face poor performance

Does TableBatchOperation guarantee that all entities will have same Timestamp field value?

I'm adding multiple entries to Azure CloudTable:
TableBatchOperation tableBatchOperation = new TableBatchOperation();
foreach (var entity in entities)
{
tableBatchOperation.InsertOrReplace(entity);
}
table.ExecuteBatch(tableBatchOperation);
Is there any guarantee that all entries inserted / updated in this batch operation will have the same Timestamp property value?

Short answer is: entities inserted in the same batch can have different timestamps.
It depends on the batch size and I guess current load of Table Service.
I wrote a simple unit test to check that, you can find it here and in one batch of 100 items (every with 30KB string property) I can see few different timestamps(ticks):
635516539271235769
635516539271245771
635516539271225762
but for smaller batches timestamp is sometimes the same.
Differences are really small (ticks) but for sure I would not depend on timestamp since it's internal Azure Table Service property, and it changes on every update.
I would rather add another property to an entity with a batch timestamp.

Table Storage Batch Table Results

Azure Table Storage offers a BatchOperation method. It returns a list of TableResults. From what I've seen, there is never a time where this return value will have mixed failures and successes (as a batch should be). I haven't been able to find documentation that says this is a fact though. If anyone has a handy link to this specific info let me know.

TableBatch operation is atomic, so there is no point to continue executing the batch operation after the first failure. There are 2 outcomes for a TableBatchOperation, either all operations succeed and the overall request succeeds or the request returns on the first failed operation, and the changes made by previous operations are rolled back.
The interesting thing here is that, you will get a StorageException if there is a failure in one of the operations in the batch and the index of the failed operation is embedded inside the StorageException object. Then if you want to, you can implement the logic to automatically remove that operation from the batch (and log) and resubmit the TableBatchOperation.
I have implemented a StorageException extension class which extracts the failed operation index and many other useful information from the StorageException object.
Feel free to use it:
https://www.nuget.org/packages/AzureStorageExceptionParser/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string