New item inserted in Azure Table Storage is not immediately available - azure

I have
an endpoint in an Azure Function called "INSERT" that that inserts a
record in Table Storage using a batch operation.
an endpoint in a different Azure Function
called "GET" that gets a record in Table Storage.
If I insert an item and then immediately get that same item, then the item has not appeared yet!
If I delay by one second after saving, then I find the item.
If I delay by 10ms after saving, then I don't find the item.
I see the same symptom when updating an item. I set a date field on the item. If I get immediately after deleting, then some times the date field is not set yet.
Is this known behavior in Azure Table Storage? I know about ETags as described here but I cannot see how it applies to this issue.
I cannot easily provide a code sample because this is distributed among multiple functions and I think if I did put it in a simpler example, then there would be some mechanism that would see I am calling from the same ip or with the same client and manage to return the recently saved item.

As mentioned in the comments, Azure Table Storage is Strongly Consistent. Data is available to you as soon as it is written to Storage.
This is in contrast with Cosmos DB Table Storage where there are many consistency levels and data may not be immediately available for you to read after it is written depending on the consistent level set.

The issue was related to my code and queues running in the background.
I had shut down the Function that has queue triggers but to my surprise I found that the Function in my staging slot was picking items off the queue. That is why it made a difference whether I delay for a second or two.
And to the second part, why a date field is seemingly not set as fast as I get it. Well, it turns out I had filtered by columns, like this:
var operation = TableOperation.Retrieve<Entity>(partitionKey, id, new List<string> { "Content", "IsDeleted" });
And to make matters worse, the class "Entity" that I deserialize to, of course had default primitive values (such as "false") so it didn't look like they were not being set.
So the answer does not have much to do with the question, so in summary, for anyone finding this question because they are wondering the same thing:
The answer is YES - Table Storage is in fact strongly consistent and it doesn't matter whether you're 'very fast' or connect from another location.

Related

Prevent DELETES from bypassing versioning in Amazon QLDB

Amazon QLDB allows querying the version history of a specific object by its ID. However, it also allows deleting objects. It seems like this can be used to bypass versioning by deleting and creating a new object instead of updating the object.
For example, let's say we need to track vehicle registrations by VIN.
INSERT INTO VehicleRegistration
<< {
'VIN' : '1N4AL11D75C109151',
'LicensePlateNumber' : 'LEWISR261LL'
} >>
Then our application can get a history of all LicensePlateNumber assignments for a VIN by querying:
SELECT * FROM _ql_committed_VehicleRegistration AS r
WHERE r.data.VIN = '1N4AL11D75C109151';
This will return all non-deleted document revisions, giving us an unforgeable history. The history function can be used similarly if you remember the document ID from the insert. However, if I wanted to maliciously bypass the history, I would simply delete the object and reinsert it:
DELETE FROM VehicleRegistration AS r WHERE VIN = '1N4AL11D75C109151';
INSERT INTO VehicleRegistration
<< {
'VIN' : '1N4AL11D75C109151',
'LicensePlateNumber' : 'ABC123'
} >>
Now there is no record that I have modified this vehicle registration, defeating the whole purpose of QLDB. The document ID of the new record will be different from the old, but QLDB won't be able to tell us that it has changed. We could use a separate system to track document IDs, but now that other system would be the authoritative one instead of QLDB. We're supposed to use QLDB to build these types of authoritative records, but the other system would have the exact same problem!
How can QLDB be used to reliably detect modifications to data?
There would be a record of the original record and its deletion in the ledger, which would be available through the history() function, as you pointed out. So there's no way to hide the bad behavior. It's a matter of hoping nobody knows to look for it. Again, as you pointed out.
You have a couple of options here. First, QLDB rolled-out fine-grained access control last week (announcement here). This would let you, say, prohibit deletes on a given table. See the documentation.
Another thing you can do is look for deletions or other suspicious activity in real-time using streaming. You can associate your ledger with a Kinesis Data Stream. QLDB will push every committed transaction into the stream where you can react to it using a Lambda function.
If you don't need real-time detection, you can do something with QLDB's export feature. This feature dumps ledger blocks into S3 where you can extract and process data. The blocks contain not just your revision data but also the PartiQL statements used to create the transaction. You can setup an EventBridge scheduler to kick off a periodic export (say, of the day's transactions) and then churn through it to look for suspicious deletes, etc. This lab might be helpful for that.
I think the best approach is to manage it with permissions. Keep developers out of production or make them assume a temporary role to get limited access.

Azure table storage from Python function consistently failing when looking up for missing entity

I have the following setup:
Azure function in Python 3.6 is processing some entities utilizing
the TableService class from the Python Table API (new one for
CosmosDB) to check whether the currently processed entity is already
in a storage table. This is done by invoking
TableService.get_entity.
get_entity method throws exception each
time it does not find an entity with the given rowid (same as the
entity id) and partition key.
If no entity with this id is found then i call insert_entity to add it to the table.
With this i am trying to implement processing only of entities that haven't been processed before (not logged in the table).
I observe however consistent behaviour of the function simply freezing after exactly 10 exceptions, where it halts execution on the 10th invocation and does not continue processing for another minute or 2!
I even changed the implementation of instead doing a lookup first to simply call insert_entity and let it fail when a duplicate row key is added. Surprisingly enough the behaviour there is exactly the same - on the 10th duplicate invocation the execution freezes and continues after a while.
What is the cause of such behaviour? Is this some sort of protection mechanism on the storage account that kicks in? Or is it something to do with the Python client? To me it looks very much something by design.
I wasnt able to find any documentation or settings page on the portal for influencing such behaviour.
I am wondering of it is possible to implement such logic with using table storage? I don't seem to find it justifiable to spin up a SQL Azure db or Cosmos DB instance for such trivial functionality of checking whether an entity does not exist in a table.
Thanks

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

How to update fields automatically

In my CouchDB database I'd like all documents to have an 'updated_at' timestamp added when they're changed (and have this enforced).
I can't modify the document with validation functions
updates functions won't run unless they're called specifically (so it'd be possible to update the document and not call the specific update function)
How should I go about implementing this?
There is no way to do this now without triggering _update handlers. This is nice idea to track documents changing time, but it faces problems with replications.
Replications are working on top of public API and this means that:
In case of enforcing such trigger you'll have replications broken since it will be impossible to sync data as it is without document modification. Since document get modified, he receives new revision which may easily lead to dead loop if you replicate data from database A to B and B to A in continuous mode.
In other case when replications are fixed there will be always way to workaround your trigger.
I can suggest one work around - you can create a view which emits a current date as a key (or a part of it):
function( doc ){
emit( new Date, null );
}
This will assign current dates to all documents as soon as the view generation gets triggered (which happens after first request to it) and will reassign new dates on each update of a specific document.
Although the above should solve your issue, I would advice against using it for the reasons already explained by Kxepal: if you're on a replicated network, each node will assign its own dates. So taking this into account, the best I can recommend is to solve the issue on the client side and just post the documents with a date already embedded.

Is it possible to make conditional inserts with Azure Table Storage

Is it possible to make a conditional insert with the Windows Azure Table Storage Service?
Basically, what I'd like to do is to insert a new row/entity into a partition of the Table Storage Service if and only if nothing changed in that partition since I last looked.
In case you are wondering, I have Event Sourcing in mind, but I think that the question is more general than that.
Basically I'd like to read part of, or an entire, partition and make a decision based on the content of the data. In order to ensure that nothing changed in the partition since the data was loaded, an insert should behave like normal optimistic concurrency: the insert should only succeed if nothing changed in the partition - no rows were added, updated or deleted.
Normally in a REST service, I'd expect to use ETags to control concurrency, but as far as I can tell, there's no ETag for a partition.
The best solution I can come up with is to maintain a single row/entity for each partition in the table which contains a timestamp/ETag and then make all inserts part of a batch consisting of the insert as well as a conditional update of this 'timestamp entity'. However, this sounds a little cumbersome and brittle.
Is this possible with the Azure Table Storage Service?
The view from a thousand feet
Might I share a small tale with you...
Once upon a time someone wanted to persist events for an aggregate (from Domain Driven Design fame) in response to a given command. This person wanted to ensure that an aggregate would only be created once and that any form of optimistic concurrency could be detected.
To tackle the first problem - that an aggregate should only be created once - he did an insert into a transactional medium that threw when a duplicate aggregate (or more accurately the primary key thereof) was detected. The thing he inserted was the aggregate identifier as primary key and a unique identifier for a changeset. A collection of events produced by the aggregate while processing the command, is what is meant by changeset here. If someone or something else beat him to it, he would consider the aggregate already created and leave it at that. The changeset would be stored beforehand in a medium of his choice. The only promise this medium must make is to return what has been stored as-is when asked. Any failure to store the changeset would be considered a failure of the whole operation.
To tackle the second problem - detection of optimistic concurrency in the further life-cycle of the aggregate - he would, after having written yet another changeset, update the aggregate record in the transactional medium if and only if nobody had updated it behind his back (i.e. compared to what he last read just before executing the command). The transactional medium would notify him if such a thing happened. This would cause him to restart the whole operation, rereading the aggregate (or changesets thereof) to make the command succeed this time.
Of course, now he had solved the writing problems, along came the reading problems. How would one be able to read all the changesets of an aggregate that made up its history? Afterall, he only had the last committed changeset associated with the aggregate identifier in that transactional medium. And so he decided to embed some metadata as part of each changeset. Among the meta data - which is not so uncommon to have as part of a changeset - would be the identifier of the previous last committed changeset. This way he could "walk the line" of changesets of his aggregate, like a linked list so to speak.
As an additional perk, he would also store the command message identifier as part of the metadata of a changeset. This way, when reading changesets, he could know in advance if the command he was about to execute on the aggregate was already part of its history.
All's well that ends well ...
P.S.
1. The transactional medium and changeset storage medium can be the same,
2. The changeset identifier MUST not be the command identifier,
3. Feel free to punch holes in the tale :-),
4. Although not directly related to Azure Table Storage, I've implemented the above tale successfully using AWS DynamoDB and AWS S3.
How about storing each event at "PartitionKey/RowKey" created based on AggregateId/AggregateVersion?where AggregateVersion is a sequential number based on how many events the aggregate already has.
This is very deterministic, so when adding a new event to the aggregate, you will make sure that you were using the latest version of it, because otherwise you'll get an error saying that the row for that partition already exists. At this time you can drop the current operation and retry, or try to figure out if you could merge the operation anyways if the new updates to the aggregate do not conflict to the operation you just did.

Resources