Can you make a Azure Data Factory data flow for updating data using a foreign key? - azure

I've tried this a few ways and seem to be blocked.
This is nothing more than a daily ETL process. What I'm trying to do is to use ADF and pull in a csv as one of my datasets. With that data I need to update docs in a CosmosDb container, which is the other dataset in this flow. My data really simple.
ForeignId string
Value1 int
Value2 int
Value3 int
The Cosmos docs all have these data items and more. ForeignId is unique in the container and is the partition key. The docs are a composite dataset that actually have 3 other id fields that would be considered the PK in the system of origin.
When you try and use a data flow UPDATE with this data the validation complains that you have to map "Id" to use UPDATE. I have an Id in my document, but it only relates to my collection, not to old, external systems. I have no choice but to use the ForeignId. I have it flowing using UPSERT but, even though I have the ForeignId mapped between the datasets, I get inserts instead of updates.
Is there something I'm missing or is ADF not set up to sync data based on anything other than the a data item named "id"? Is there another option ADF aside from the straight-forward approach? I've read that you can drop updates into the Lookup tasks but that seems like a hack.

The row ID is needed by CosmosDB to know which row to update. It has nothing to do with ADF.
To make this work in ADF, add an Exists transformation in your data flow to see if the row already exists in your collection. Check using the foreign key column in your incoming source data against the existing collection.
If a row is found with that foreign key, then you can the corresponding ID to your metadata, allowing you to include it in your sink.

Related

Delta Logic implementation using SnapLogic

Is there any snap available in SnapLogic to do following
Connect with snowflake and get data by SELECT * FROM VIEW
Connect with Azure Blob Storage and get the data from csv file : FILENAME_YYYYMMDD.csv
Take only those data which are available in 1 but NOT available in 2 and write this delta back to Azure Blob Storage : FILENAME_YYYYMMDD.csv
Is In-Memory Look-Up useful for this?
No, In-Memory Lookup snap is used for cases where you need to look up the value corresponding to the value in a certain field of the incoming records. For example, say you want to look up a country name against the country ISO code. This snap generally fetches the lookup table once and stores it in memory. Then it uses this stored lookup table to provide data corresponding to the incoming records.
In your case, you have to use the Join snap and configure it to an inner join.

Azure Cosmos Db as key value store indexing mode

What indexing mode / policy should I use when using cosmos db as a simple key/value store?
From https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy :
None: Indexing is disabled on the container. This is commonly used when a container is used as a pure key-value store without the need for secondary indexes.
Is this because the property used as partition key is indexed even when indexMode is set to “none”? I would expect to need to turn indexing on but specify just the partition key’s path as the only included path.
If it matters, I’m planning to use the SQL API.
EDIT:
here's the information I was missing to understand this:
The item must have an id property, otherwise cosmos db will assign one. https://learn.microsoft.com/en-us/azure/cosmos-db/account-databases-containers-items#properties-of-an-item
Since I'm using Azure Data Factory to load the items, I can tell ADF to duplicate the column that has the value I want to use as my id into a new column called id: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
I need to use ReadItemAsync, or better yet, ReadItemStreamAsync since it doesn't deserialize the response, to get the item without using a query.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.container.readitemasync?view=azure-dotnet
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.container.readitemstreamasync?view=azure-dotnet
When you set indexingMode to "none", the only way to efficiently retrieve a document is by id (e.g. ReadDocumentAsync() or read_item()). This is akin to a key/value store, since you wouldn't be performing queries against other properties; you'd be specifically looking up a document by some known id, and returning the entire document. Cost-wise, this would be ~1RU for a 1K document, just like point-reads with an indexed collection.
You could still run queries, but without indexes, you'll see unusually-high RU cost.
You would still specify the partition key's value with your point-reads, as you'd normally do.

Azure Table Storage data modeling considerations

I have a list of users. A user can either login either using username or e-mail address.
As a beginner in azure table storage, this is what I do for the data model for fast index scan.
PartitionKey RowKey Property
users:email jacky#email.com nickname:jack123
users:username jack123 email:jacky#email.com
So when a user logs in via email, I would supply PartitionKey eq users:email in the azure table query. If it is username, Partition eq users:username.
Since it doesn't seem possible to simulate contains or like in azure table query, I'm wondering if this is a normal practice to store multiple row of data for 1 user ?
Since it doesn't seem possible to simulate contains or like in azure
table query, I'm wondering if this is a normal practice to store
multiple row of data for 1 user ?Since it doesn't seem possible to
simulate contains or like in azure table query, I'm wondering if this
is a normal practice to store multiple row of data for 1 user ?
This is a perfectly valid practice and in fact is a recommended practice. Essentially you will have to identify the attributes on which you could potentially query your table storage and somehow use them as a combination of PartitionKey and RowKey.
Please see Guidelines for table design for more information. From this link:
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with
different keys) to enable more efficient queries.

Get all the Partition Keys in Azure Cosmos DB collection

I have recently started using Azure Cosmos DB in our project. For the reporting purpose, we need to get all the Partition Keys in the collection. I could not find any suitable API to achieve it.
UPDATE: According to Brian in the comments below, DISTINCT is now supported. Try something like:
SELECT DISTINCT c.partitionKey FROM c
Prior answer: Idea that could work but for one thing...
The only way to get the actual partition key values is to do a unique aggregate on that field.
You can directly hit the REST endpoint at https://{your endpoint domain}.documents.azure.com/dbs/{your collection's uri fragment}/pkranges to pull back the minInclusive and maxExclusive ranges for each partition but those are hash space ranges and I don't know how to convert those into partition key values nor do a fanout using the actual minInclusive hash.
Also, there is a slim possibility that the pkranges can change between the time you retrieve them and the time you go to do something with them.

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

Resources