In Azure Table Storage does CreateIfNotExists count as transaction? - azure

We have a working code, I'm thinking about transaction optimizations, every time we insert something into table, CreateIfNotExists() is called. Is it counted as transaction? We have many tables, for every custommer several, to have ability to delete it by one transaction.
Would it be better approach to insert data and if it fails with some "Table does not exists" exception create it and insert data again?

Every time we insert something into table, CreateIfNotExists() is
called. Is it counted as transaction?
Yes. Essentially CreateIfNotExists tries to create a table and catches and compares the exception with Conflict (409) status code. Since it is a PUT transaction, you get charged for this.
Would it be better approach to insert data and if it fails with some
"Table does not exists" exception create it and insert data again?
You can certainly do that. Other approach (not sure if it would work for you) is to check for these tables on application startup only (in fact, this is what we're doing in our application).

Related

Azure table storage from Python function consistently failing when looking up for missing entity

I have the following setup:
Azure function in Python 3.6 is processing some entities utilizing
the TableService class from the Python Table API (new one for
CosmosDB) to check whether the currently processed entity is already
in a storage table. This is done by invoking
TableService.get_entity.
get_entity method throws exception each
time it does not find an entity with the given rowid (same as the
entity id) and partition key.
If no entity with this id is found then i call insert_entity to add it to the table.
With this i am trying to implement processing only of entities that haven't been processed before (not logged in the table).
I observe however consistent behaviour of the function simply freezing after exactly 10 exceptions, where it halts execution on the 10th invocation and does not continue processing for another minute or 2!
I even changed the implementation of instead doing a lookup first to simply call insert_entity and let it fail when a duplicate row key is added. Surprisingly enough the behaviour there is exactly the same - on the 10th duplicate invocation the execution freezes and continues after a while.
What is the cause of such behaviour? Is this some sort of protection mechanism on the storage account that kicks in? Or is it something to do with the Python client? To me it looks very much something by design.
I wasnt able to find any documentation or settings page on the portal for influencing such behaviour.
I am wondering of it is possible to implement such logic with using table storage? I don't seem to find it justifiable to spin up a SQL Azure db or Cosmos DB instance for such trivial functionality of checking whether an entity does not exist in a table.
Thanks

How to stop power query querying full original dataset

I have an excel file connected to an access database. I have created a query through Power Query that simply brings the target table into the file and does a couple of minor things to it. I don’t load this to a worksheet but maintain a connection only.
I then have a number of other queries linking to the table created in the first query.
In one of these linked queries, I apply a variety of filters to exclude certain products, customers and so on. This reduces the 400,000 records in the original table in the first query down to around 227,000 records. I then load this table to a worksheet to do some analysis.
Finally I have a couple of queries looking at the 227,000 record table. However, I notice that when I refresh these queries and watch the progress in the right hand pane, they still go through 400,000 records as if they are looking through to the original table.
Is there any way to stop this happening in the expectation that doing so would help to speed up queries that refer to datasets that have themselves already been filtered?
Alternatively is there a better way to do what I’m doing?
Thanks
First: How are you refreshing your queries? If you execute them one at a time then yes, they're all independent. However, when using Excel 2016 on a workbook where "Fast Data Load" is disabled on all queries, I've found that a Refresh All does cache and share query results with downstream queries!
Failing that, you could try the following:
Move the query that makes the 227,000-row table into its own group called "Refresh First"
Place your cursor in your 227,000-row table and click Data - Get &
Transform - From Table,
Change all of your queries to pull from this new query rather than the source.
Create another group called "Refresh Second" that contains every query that
is downstream of the query you created in step 2, and
loads data to the workbook
Move any remaining queries that load to the workbook into "Refresh First", "Refresh Second", or some other group. (By the way: I usually also have a "Connections" group that holds every query that doesn't load data to the workbook, too.)
Unfortunately, once you do this, "Refresh All" would have to be done twice to ensure all source changes are fully propagated because those 227,000 rows will be used before they've been updated from the 400,000. If you're willing to put up with this and refresh manually then you're all set! You can right-click and refresh query groups. Just right-cick and refresh the first group, wait, then right-click and refresh the second one.
For a more idiot-proof way of refreshing... you could try automating it with VBA, but queries normally refresh in the background; it will take some extra work to ensure that the second group of queries aren't started before all of the queries in your "Refresh First" group are completed.
Or... I've learned to strike a balance between fidelity in the real world but speed when developing by doing the following:
Create a query called "ProductionMode" that returns true if you want full data, or false if you're just testing each query. This can just be a parameter if you like.
Create a query called "fModeSensitiveQuery" defined as
let
// Get this once per time this function is retrived and cached, OUTSIDE of what happens each time the returned function is executed
queryNameSuffix = if ProductionMode then
""
else
" Cached",
// We can now use the pre-rendered queryNameSuffix value as a private variable that's not computed each time it's called
returnedFunction = (queryName as text) as table => Expression.Evaluate(
Expression.Identifier(
queryName & queryNameSuffix
),
#shared
)
in
returnedFunction
For each slow query ("YourQueryName") that loads to the table,
Create "YourQueryName Cached" as a query that pulls straight from results the table.
Create "modeYourQueryName" as a query defined as fModeSensitiveQuery("YourQueryName")
Change all queries that use YourQueryName to use modeYourQueryName instead.
Now you can flip ProductionMode to true and changes propagate completely, or flip ProductionMode to false and you can test small changes quickly; if you're refreshing just one query it isn't recomputing the entire upstream to test it! Plus, I don't know why but when doing a Refresh All I'm pretty sure it also speeds up the whole thing even when ProductionMode is true!!!
This method has three caveats that I'm aware of:
Be sure to update your "YourQueryName Cached" query any time the "YourQueryName" query's resulting columns are added, removed, renamed, or typed differently. Or better yet, delete and recreate them. You can do this because,
Power-Query won't recognize your "YourQueryName" and "YourQueryName Cached" queries as dependencies of "modeYourQueryName". The Query Dependences diagram won't be quite right, you'll be able to delete "YourQueryName" or "YourQueryName Cached" without Power Query stopping you, and renaming YourQueryName will break things instead of Power Query automatically changing all of your other queries accordingly.
While faster, the user-experience is a rougher ride, too! The UI gets a little jerky because (and I'm totally guessing, here) this technique seems to cause many more queries to finish simultaneously, flooding Excel with too many repaint requests at the same time. (This isn't a problem, really, but it sure looks like one when you aren't expecting it!)

Explicit table locking to disable DELETES?

Using Oracle 11gR2:
We already have a process that cleans up particular tables by deleting records from them that are past a specified retention date (based on the comparison between the timestamp from when the record finished processing and the retention date). I am currently writing code that will alert my team if this process fails. The only way I can see this process possibly failing is if DELETEs are disabled on the particular table it is trying to clean up.
I want to test the alerts to make sure they work and look correct by having the process fail. If I temporarily exclusively lock the table, will that disable DELETEs and cause the procedure that deletes records to fail? Or does it only disable DDL operations? Is there a better way to do this?
Assuming that "fail" means "throw an error" rather than, say, exceeding some performance bound, locking the table won't accomplish what you want. If you locked every row via a SELECT FOR UPDATE in one session, your delete job would block forever waiting for the first session to release its lock. That wouldn't throw an error and wouldn't cause the process to fail for most definitions. If your monitoring includes alerts for jobs that are running longer than expected, however, that would work well.
If your monitoring process only looks to see if the process ran and encountered an error, the easiest option would be to put a trigger on the table that throws an error when there is a delete. You could also create a child table with a foreign key constraint that would generate an error if the delete tried to delete the parent row while a child row exists. Depending on how the delete process is implemented, you probably could engineer a second process that would produce an ORA-00060 deadlock for the process you are monitoring but that is probably harder to implement than the trigger or the child table.

Selecting and updating against tables in separate data sources within the same transaction

The attributes for the <jdbc:inbound-channel-adapter> component in Spring Integration include data-source, sql and update. These allow for separate SELECT and UPDATE statements to be run against tables in the specified database. Both sql statements will be part of the same transaction.
The limitation here is that both the SELECT and UPDATE will be performed against the same data source. Is there a workaround for the case when the the UPDATE will be on a table in a different data source (not just separate databases on the same server)?
Our specific requirement is to select rows in a table which have a timestamp prior to a specific time. That time is stored in a table in a separate data source. (It could also be stored in a file). If both sql statements used the same database, the <jdbc:inbound-channel-adapter> would work well for us out of the box. In that case, the SELECT could use the time stored, say, in table A as part of the WHERE clause in the query run against table B. The time in table A would then be updated to the current time, and all this would be part of one transaction.
One idea I had was, within the sql and update attributes of the adapter, to use SpEL to call methods in a bean. The method defined for sql would look up a time stored in a file, and then return the full SELECT statement. The method defined for update would update the time in the same file and return an empty string. However, I don't think such an approach is failsafe, because the reading and writing of the file would not be part of the same transaction that the data source is using.
If, however, the update was guaranteed to only fire upon commit of the data source transaction, that would work for us. If the event of a failure, the database transaction would commit, but the file would not be updated. We would then get duplicate rows, but should be able to handle that. The issue would be if the file was updated and the database transaction failed. That would mean lost messages, which we could not handle.
If anyone has any insights as to how to approach this scenario it is greatly appreciated.
Use two different channel adapters with a pub-sub channel, or an outbound gateway followed by an outbound channel adapter.
If necessary, start the transaction(s) upstream of both; if you want true atomicity you would need to use an XA transaction manager and XA datasources. Or, you can get close by synchronizing the two transactions so they get committed very close together.
See Dave Syer's article "Distributed transactions in Spring, with and without XA" and specifically the section on Best Efforts 1PC.

Is it possible to make conditional inserts with Azure Table Storage

Is it possible to make a conditional insert with the Windows Azure Table Storage Service?
Basically, what I'd like to do is to insert a new row/entity into a partition of the Table Storage Service if and only if nothing changed in that partition since I last looked.
In case you are wondering, I have Event Sourcing in mind, but I think that the question is more general than that.
Basically I'd like to read part of, or an entire, partition and make a decision based on the content of the data. In order to ensure that nothing changed in the partition since the data was loaded, an insert should behave like normal optimistic concurrency: the insert should only succeed if nothing changed in the partition - no rows were added, updated or deleted.
Normally in a REST service, I'd expect to use ETags to control concurrency, but as far as I can tell, there's no ETag for a partition.
The best solution I can come up with is to maintain a single row/entity for each partition in the table which contains a timestamp/ETag and then make all inserts part of a batch consisting of the insert as well as a conditional update of this 'timestamp entity'. However, this sounds a little cumbersome and brittle.
Is this possible with the Azure Table Storage Service?
The view from a thousand feet
Might I share a small tale with you...
Once upon a time someone wanted to persist events for an aggregate (from Domain Driven Design fame) in response to a given command. This person wanted to ensure that an aggregate would only be created once and that any form of optimistic concurrency could be detected.
To tackle the first problem - that an aggregate should only be created once - he did an insert into a transactional medium that threw when a duplicate aggregate (or more accurately the primary key thereof) was detected. The thing he inserted was the aggregate identifier as primary key and a unique identifier for a changeset. A collection of events produced by the aggregate while processing the command, is what is meant by changeset here. If someone or something else beat him to it, he would consider the aggregate already created and leave it at that. The changeset would be stored beforehand in a medium of his choice. The only promise this medium must make is to return what has been stored as-is when asked. Any failure to store the changeset would be considered a failure of the whole operation.
To tackle the second problem - detection of optimistic concurrency in the further life-cycle of the aggregate - he would, after having written yet another changeset, update the aggregate record in the transactional medium if and only if nobody had updated it behind his back (i.e. compared to what he last read just before executing the command). The transactional medium would notify him if such a thing happened. This would cause him to restart the whole operation, rereading the aggregate (or changesets thereof) to make the command succeed this time.
Of course, now he had solved the writing problems, along came the reading problems. How would one be able to read all the changesets of an aggregate that made up its history? Afterall, he only had the last committed changeset associated with the aggregate identifier in that transactional medium. And so he decided to embed some metadata as part of each changeset. Among the meta data - which is not so uncommon to have as part of a changeset - would be the identifier of the previous last committed changeset. This way he could "walk the line" of changesets of his aggregate, like a linked list so to speak.
As an additional perk, he would also store the command message identifier as part of the metadata of a changeset. This way, when reading changesets, he could know in advance if the command he was about to execute on the aggregate was already part of its history.
All's well that ends well ...
P.S.
1. The transactional medium and changeset storage medium can be the same,
2. The changeset identifier MUST not be the command identifier,
3. Feel free to punch holes in the tale :-),
4. Although not directly related to Azure Table Storage, I've implemented the above tale successfully using AWS DynamoDB and AWS S3.
How about storing each event at "PartitionKey/RowKey" created based on AggregateId/AggregateVersion?where AggregateVersion is a sequential number based on how many events the aggregate already has.
This is very deterministic, so when adding a new event to the aggregate, you will make sure that you were using the latest version of it, because otherwise you'll get an error saying that the row for that partition already exists. At this time you can drop the current operation and retry, or try to figure out if you could merge the operation anyways if the new updates to the aggregate do not conflict to the operation you just did.

Resources