Transaction pattern for Azure Table Storage across multiple tables? - azure

Are there any software patterns that would enable a transaction across multiple tables in Azure Table Storage?
I want to write (or delete) several entities from different tables in atomic way like...
try {
write entity to table A
write entity to table B
} catch {
delete entity from table A
delete entity from table B
}
During the above transaction I also want to prevent anyone from writing/deleting the same entities (same table, partition key and row key).
I know Azure Storage does not support this directly so I'm looking for a pattern perhaps using an additional table to "lock" entities in the transaction until its complete. All writers would have to obtain a lock on the entities.

The only way to ensure that no one else modifies rows in a table while you are working on them is to add the overhead of blob leasing. You can have the one instance/thread grab the blob lease and do whatever it needs to. Then, when done, release the blob. If it fails to grab the lease, it either has to wait or try again later.
The other table based operations, like pessimistic concurrency, will not actually prevent someone from modifying the records.

Related

Can you make a Azure Data Factory data flow for updating data using a foreign key?

I've tried this a few ways and seem to be blocked.
This is nothing more than a daily ETL process. What I'm trying to do is to use ADF and pull in a csv as one of my datasets. With that data I need to update docs in a CosmosDb container, which is the other dataset in this flow. My data really simple.
ForeignId string
Value1 int
Value2 int
Value3 int
The Cosmos docs all have these data items and more. ForeignId is unique in the container and is the partition key. The docs are a composite dataset that actually have 3 other id fields that would be considered the PK in the system of origin.
When you try and use a data flow UPDATE with this data the validation complains that you have to map "Id" to use UPDATE. I have an Id in my document, but it only relates to my collection, not to old, external systems. I have no choice but to use the ForeignId. I have it flowing using UPSERT but, even though I have the ForeignId mapped between the datasets, I get inserts instead of updates.
Is there something I'm missing or is ADF not set up to sync data based on anything other than the a data item named "id"? Is there another option ADF aside from the straight-forward approach? I've read that you can drop updates into the Lookup tasks but that seems like a hack.
The row ID is needed by CosmosDB to know which row to update. It has nothing to do with ADF.
To make this work in ADF, add an Exists transformation in your data flow to see if the row already exists in your collection. Check using the foreign key column in your incoming source data against the existing collection.
If a row is found with that foreign key, then you can the corresponding ID to your metadata, allowing you to include it in your sink.

Azure Table Storage data modeling considerations

I have a list of users. A user can either login either using username or e-mail address.
As a beginner in azure table storage, this is what I do for the data model for fast index scan.
PartitionKey RowKey Property
users:email jacky#email.com nickname:jack123
users:username jack123 email:jacky#email.com
So when a user logs in via email, I would supply PartitionKey eq users:email in the azure table query. If it is username, Partition eq users:username.
Since it doesn't seem possible to simulate contains or like in azure table query, I'm wondering if this is a normal practice to store multiple row of data for 1 user ?
Since it doesn't seem possible to simulate contains or like in azure
table query, I'm wondering if this is a normal practice to store
multiple row of data for 1 user ?Since it doesn't seem possible to
simulate contains or like in azure table query, I'm wondering if this
is a normal practice to store multiple row of data for 1 user ?
This is a perfectly valid practice and in fact is a recommended practice. Essentially you will have to identify the attributes on which you could potentially query your table storage and somehow use them as a combination of PartitionKey and RowKey.
Please see Guidelines for table design for more information. From this link:
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with
different keys) to enable more efficient queries.

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

Azure Storage - Handle cross partition updates

I have a question about a best-practice when working with the Azure Table service.
Imagine a table called Customers. Imagine several other tables, split into a vast amount of partitions. In these tables, there are CustomerName fields.
In the case that a Customer changes his name... Then I update the corresponding record in the Customers table. In contrast to a relational database, all the other columns in the other table are (obviously) not updated.
What is the best way to make sure that all the other tables are also updated? It seems extremely inefficient to me to query all tables on the CustomerName, and subsequently update all these records.
If you are storing the CustomerName multiple times across tables there is no magic about it, you will need to find those records and update the CustomerName field on them as well.
Since it is quite an inefficient operation, you can (and should) do this "off-transaction". Meaning, when you perform your initial "Name Change" operation, push an item onto a queue and have a worker perform the "Name Change". Since there is no web response / user waiting anxiously for the worker to complete the fact that it is ridiculously inefficient is inconsequential.
This is a primary design pattern for implementing eventual consistency within distributed systems.

TransactionScope and Azure Table Storage

Is there an equivalent to TransactionScope that you can use with Azure Table Storage?
what I'm trying to do is the following:
using (TransactionScope scope = new TransactionScope) {
account.balance -= 10;
purchaseOrders.Add(order);
accountDataSource.SaveChanges();
purchaseOrdersDataSource.SaveChanges();
scope.Complete();
}
If for some reason saving the account works, but saving the purchase order fails, I don't want the account to decrement the balance.
Within a single table and single partition, you may write multiple rows in an entity group transaction. There's no built-in transaction mechanism when crossing partitions or tables.
that said: remember that tables are schema-less, so if you really needed a transaction, you could store both your account row and your purchase order row in the same table, same partition, and do a single (transactional) save.

Resources