CosmosDB change feed, leases and azure functions

CosmosDB change feed, leases and azure functions - azure

I've recently started working with Azure CosmosDB and functions. While reading documentation https://learn.microsoft.com/pl-pl/azure/cosmos-db/change-feed-processor I found something that is quite hard to understand for me. Is it actually possible to share a change feed between many functions so they will be triggered by one and same db operation? What is the lease collection and what problem does it solve. What is the purpose of lease? I'd like to have a basic explaination of these terms. In the link i provided it is said that it is possible to share a lease between two functions but then it is said that a lease object has an owner property.

Yes you can have multiple functions being triggered from the same change. However this requires you to have separate leases for them. They can live in the same lease collection but they need a different prefix. There is a setting for that. In Azure functions it's the leaseCollectionPrefix attribute property.
Leases are really just documents like any other in Cosmos DB that will be used to keep track of the consumers for this change feed processor and save some checkpoints so they know where to continue if your app restarts.

Related

How to handle an Azure Function rerunning when using message queue binding?

I have a v1 Azure Function that is triggered by a message being written to the Azure Storage message queue.
The Azure Function needs to perform multiple updates to SharePoint Online. Occasionally these operations fail. This results in the message being returned to the queue and being reprocessed.
When I developed the function, I didn't consider that it might partially complete and then restart. I've done a little research and it sounds like I need to modify it to be re-entrant.
Is there a design pattern that I should follow to cater for this without having to add a lot of checks to determine if an operation has already been carried out by a previous execution? Alternatively, is there any Azure functionality that can help (beyond the existing message retries and poison queue)

It sounds like you will need to do some re-engineering. Our team had a similar issue and wrote a home-grown solution years ago. But we eventually scrapped our solution and went with Azure Durable Functions.
Not gonna lie - this framework has some complexity and it took me a bit to wrap my head around it. Check out the function chaining pattern.
We have processing that requires multiple steps that all must be completed. We're spanning multiple data stores (Updating Cosmos Db, Azure SQL, Blob Storage, etc), so there's no support for distributed transactions across multiple PaaS offerings. Durable Functions will allow you to break your process up into discrete steps. If a step fails, an orchestrator will re-run that step based on a retry policy.
So in a nutshell, we use Durable Task Activity functions to attempt each step. If the step fails due to what we think is a transient error, we retry. If it's an unrecoverable error, we don't retry.

How to cache data between Azure Durable Function orchestration instances?

Documentation states that Azure Durable Function orchestrations code should be deterministic, cos of replays. In my case, I have some data in Azure Table Storage, that I need to fetch in workflow. The workflow is recursive and the data in Azure Table Storage can change during execution, and it is OK to have stale state for ~1 min. In regular code I would rely on memory cache to improve the performance. But in orchestrations, suppose it can not be used directly, cos this makes workflow non-deterministic.
I can still use cache in activity and call it from orchestrations, but every activity call involves serialization\deserialization of inputs\outputs and passing messages though control queue. These operations are heavier then fetching data itself.
So I have a question, is there any pattern, that can be used to cache data between orchestration instances in memory, without wrapping this logic in activity?

What I can suggest you is: use a distributed cache, specifically Redis Cache for Azure.
I drew an image for you:
Get your data from Azure Table Storage in your orchestration, do your operation in there and save it to Redis cache. Then pass the id of the required data to each activity. Then you can get the data from Redis cache inside each activity.
This is a solution with cache as you asked. However, please note that if you want high-performance data query, Azure Table Storage is not the best solution to work with. I suggest you to use either Azure SQL or CosmosDB. But if you are seeking a cheap option that's fine. But in that case, Redis cache won't be good option for you, because it's not a cheap solution neither. If this Redis cache won't work for you, I would suggest you review your algorithm.
Good luck!

You can store data between orchestrations with entity functions.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-entities
And be able to 64 operations per second.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#performance-targets

Anzure Search ingestion events

I'd like to know if Azure Search offers any ability to trigger an Azure Function when a document gets indexed or inserted into Azure search or if there are any other events I can take advantage of.
I'd like to avoid a timed event which continuously scans Azure search for new documents.

If you're using an indexer, you can add a skillset with a WebApiSkill to invoke your Azure Function for each inserted document. However, there's no transactional consistency guarantees - a document for which your function will be invoked is not guaranteed to be successfully inserted into the index.

Unfortunately, there isn't a great way to do this today. Eugene's suggestion will work, but isn't super efficient and also does indeed have the limitation of the document might not actually make it to the index if something else goes wrong later in the indexer. Please vote on the following uservoice item which is related to implementation for triggered events for Azure Cognitive Search if you are interested in seeing a more well defined option for this scenario: https://feedback.azure.com/forums/263029-azure-search/suggestions/10095111-azure-search-alerts

Client / Server syncing with Azure Table Storage

There must be a solution to this already but i'm having an issue finding it.
We have data stored in table storage and we are syncing it with an offline capable client web app over a restful api (Web API).
We are using a high watermark(currently a date time) to make sure we only download the data which has changed/added.
e.g. clients/get?watermark=2013-12-16 10:00
The problem we are facing with this approach is what happens in the edge case where multiple servers are inserting data whilst a get happens. There is a possibility that data could be inserted with a timestamp lower than the client's timestamp.
Should we worry about this or can someone recommend a better way of doing this?
I believe our main issue is inserting the data into the store. At this point there is no way to guarantee the timestamp used or the Azure box has the correct time against the other azure boxes.

Are you able to insert data into queues when inserting data into table storage? If you are able to do so, you can build off a sync that monitors the queue and inserts data based upon what's in the queue. This will allow you to not worry about timestamps and date-sync issues.
Will also make your table storage scanning faster, as you'll be able to go direct to table storage by Partition/Row keys that would presumably be in the queue messages
Edited to provide further information:
I re-read your question and realized you're looking to sync with many client applications and not necessary with a single premise-sync system which I assumed originally.
In this case, I'm slightly tweaking my suggestion:
Consider using Service Bus and publishing messages to a Service Bus Topic, everytime you change/insert Azure Table Story (ATS) entity. This message could contain an individual PartitionKey/RowKey or perhaps some other meta information as to which ATS entities have been changed.
Your individual disconnectable clients would subscribe to the Service Bus Topic through an individual Service Bus Topic Subscription and be able to pull and handle individual service bus messages and sync whatever ATS entities described in those messages.
This way you'll not really care about last-modified timestamps of your entities and only care about handling pulling messages from the service bus topic. If your client pulls all of the messages from a topic and synchronizes all of the entities that those messages describe, it has synchronized itself, regardless of the number of workers that are inserting data into ATS and timestamps with which they insert those entities.

When you're working in a disconnected/distributed environment is hard to keep things in sync based on actual time (for this to work correctly the time needs to be in sync between all actors).
Instead you should try looking at logical clocks (like a vector clock). You'll find plenty of Java examples but if you're planning to do this in .NET the examples are pretty limited.
On the other hand you might want to take a look at how the Sync Framework handles synchronization.

Creating incremental reports using Azure Tables

I need to create incremental reports in the table storage. I need to be able to update the same records from several different worker role instances (different roles with several instances each).
My reports consist mainly of values that I need to increment after I parse the raw data I initially stored.
The optimistic solution I found is to use a retry mechanism: Try to update the record. If you get a 412 result code (you don't have the latest ETAG value), retry. This solution becomes less efficient and more costly the more users you have and the more data you need to update simultaneously (my case exactly).
Another solution that comes to mind is to have only one instance of one worker role that can possibly update any given record. This is very problematic because this means that I will by-design create bottlenecks in my architecture, which is the opposite of the scale I want to reach with Azure.
If anyone here has some best practices in mind for such a use case, I would love to hear it.

Most cloud storages (Table Storage is one of those) do not offer scalable writes on a single entity/blob/whatever. There is no quick-fix for this limitation, as this limitation comes from the core tradeoff that have being made to create cloud storage in the first place.
Basically, a storage unit (entity/blob/whatever) can be updated about once every 20ms, and that's about it. Having a dedicated worker or not will not change anything to this aspect.
Instead, you need to address your task from from a different angle. For counters, the most usual approach is the use of sharded counters (link is for GAE, but you can implement an equivalent behavior on Azure).
Also, another way to ease the pain to go for an asynchronous architecture ala CQRS where the performance constraints you put on the update latency of entities is significantly relaxed.

I believe the approach needs re-architecture. In order to ensure scalability and limit amount of contention, you want to make sure that every write can work optimistically by providing unique Table/PartitionKey/RowKey
If you need those values for reports to be merged together, have a separate process/worker that will post-aggregated/merge the records for reporting purposes. You can use a queue or a timing mechanism to start aggregation/merging

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string