Scale CosmosDB binding for Azure Functions per logical partition

Scale CosmosDB binding for Azure Functions per logical partition - azure

I would like my Azure function to scale per logical partition instead of per physical partition. I've tested the Azure Function binding and it does scale out when I have multiple physical partitions (in my test I needed to increase our RU's from 2000 to 20000). But I don't need that much RU since I'm using it as an event store. I'm not querying the data, just processing each message through my Azure function. So I'm wondering if there is a way to let Azure Functions scale out per partition. I see that in the new v3 lib there is a ChangeFeedOptions.PartitionKey property but that class is internal and I'm not sure it does what I want.
I basically want to have as many Azure Functions running as there are new messages grouped per logical partition. What would be the best way to achieve that?

As of today this is not possible. It's not up to the user of the CF SDK to do the lease management. The CF SDK will do that for us and there is nothing we can do to change it.
The only way to theoretically actually have one lease per logical partition is to have a logical partition big enough to occupy the whole of a physical partition. This however means that you are about to hit 10GB of data in a single partition which would be the main concern you would have at this point.
I wouldn't worry about the scaling though. The CF will spawn as many leases as it needed to scale seamlessly and this scaling depends solely on the volume of data in the database and the amount of RUs allocated.

Related

Designing a timer-triggered processor which relies on data from events

I am trying to design a timer-triggered processor (all in azure) which will process a set of records that are set out for it to be consumed. It will be grouping it based on a column, creating files out of it, and dumping in a blob container. The records that it will consume are supposed to be generated based on an event - when the event is raised, containing a key, which can be used to query the data for the record (the data/ record being generated is to be pulled from different services.)
This is what I am thinking currently
Event is raised to event-grid-topic
Azure Function(ConsumerApp) is event triggered, reads the key, calls a service API to get all the data, stores that record in storage
table, with flag ready to be consumed.
Azure Function(ProcessorApp) is timer triggered, will read from the storage table, group based on another column, create and dump them as
files. This can then mark the records as processed, if not updated
already by ConsumerApp.
Some of my questions on these, apart from any way we can do it in a different better way are -
The table storage is going to fill up quickly, which will again decrease the speed to read the 'ready cases' so is there any better approach to store this intermediate & temporary data? One thing which I thought was to regularly flush the table or delete the record from the consumer app instead of marking it as 'processed'
The service API is being called for each event, which might increase the strain on that service/its database. should I group the call for records as a single API call, since the processor will run only after a said interval, or is there a better approach here?
Any feedback on this approach or a new design will be appreciated.

If you don't have to process data on Step 2 individually, you can try saving it in a blob too and add a record the blob path in Azure Table Storage to keep minimal row count.
Azure Table Storage has partitions that you can use to partition your data and keep your read operations fast. Partition scan is faster compared to table scan. In addition, Azure Table Storage is cheap, but if you have pricing concern. Then you can write a clean up function to periodically clean the processed rows. Keeping the processed rows around for a reasonable time is usually a good idea. Because you may need those for debugging issues.
By batching multiple calls in a single call, you can decrease network I/O delay. But resource contention will remain at service level. You can try moving that API to a separate service if possible to scale it separately.

Azure Cosmos DB: How to create read replicas for a specific container

In Azure Cosmos DB, is it possible to create multiple read replicas at a database / container / partition key level to increase read throughput? I have several containers that will need more than 10K RU/s per logical partition key, and re-designing my partition key logic is not an option right now. Thus, I'm thinking of replicating data (eventual consistency is fine) several times.
I know Azure offers global distribution with Cosmos DB, but what I'm looking for is replication within the same region and ideally not a full database replication but a container replication. A container-level replication will be more cost effective since I don't need to replicate most containers and I need to replicate the others up to 10 times.

Few options available though:
Within same region no replication option but you could use the Change Feed to replicate to another DB with the re-design in mind just for the purpose of using with read queries. Though it might be a better idea to either use the serverless option which is in preview or use the auto scale option as well. But you can also look at provisioned throughput and reserve the provisioned RUs for 1 year or 3 year and pay monthly just as you would in PAYG model but with a huge discount. One option would also be to do a latency test from the VM in the region where your main DB and app is running and finding out the closest region w.r.t Latency (ms) and then if the latency is bearable then you can use global replication to that region and start using it. I use this tool for latency tests but run it from the VM within the region where your app\DB is running.

My guess is your queries are all cross-partition and each query is consuming a ton of RU/s. Unfortunately there is no feature in Cosmos DB to help in the way you're asking.
Your options are to create more containers and use change feed to replicate data to them in-region and then add some sort of routing mechanism to route requests in your app. You will of course only get eventual consistency with this so if you have high concurrency needs this won't work. Your only real option then is to address the scalability issues with your design today.
There is something that can help is this live data migrator. You can use this to keep a second container in sync with your original that will allow you to eventually migrate off your first design to another that scales better.

How to ensure only one Azure Function BlobTrigger runs at a time?

I have a use case to implement multiple BlobTriggers in Azure Functions (using the Linux Consumption Plan). For example in Azure Storage I would have 5 different clients with a directory structure like:
client1/file.txt
client2.file.txt
client3/file.txt
client4/file.txt
client5/file.txt
It's possible for both client1/file.txt and client2/file.txt to be dropped off at the same time in Azure Storage. To prevent race conditions and exceeding the 1.5 GB memory limit, I would like the BlobTrigger for client1/file.txt to wait for the BlobTrigger for client2/file.txt to finish or vice versa (the order doesn't matter here, just that both of them eventually execute).
Do I have to set up a queue process separately? Can I use the preview setting WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUTto achieve this easily?
Edit: Would using durable functions be a better solution?

You should be able to do this by making sure the MAX SCALE OUT value is set to 1, this way it will only process 1 file at a time. You can also change your consumption\pricing model from consumption to app service plan one. This way you can use the tier you want, then you can have more memory available as well (depending on the tier you choose).

What to do if Azure SQL Managed Instance reaches the max storage limit?

Azure SQL Managed Instance can reach the storage limit if the total sum of sizes of the database (both user and system) reaches the instance limit. In this case the following issues might happen:
Any operation that updates data or rebuild structures might fail because it cannot be written in the log.
Some read-only queries might fail if they require tempdb that cannot grow.
Automated backup might not be taken because database must perform checkpoint to flush the dirty pages to data files, and this action fails because there is no space.
How to resolve this problem is the managed instance reaches the storage limit?

There are several way to resolve this issue:
Increase the instance storage limit using portal, PowerShell, Azure
CLI.
Decrease the size of database by using DBCC SHRINKDB, or
dropping unnecessary data/tables (for example #temporary tables in
tempdb)
The preferred way is is to increase the storage because even if you free some space, next maintenance operation might fill it again.

Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?

So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?

Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.
This might not apply to your scenario, but it's something to consider...

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
Blobs – Since the partition key is down to the blob name, we can load
balance access to different blobs across as many servers in order to
scale out access to them. This allows the containers to grow as large
as you need them to (within the storage account space limit). The
tradeoff is that we don’t provide the ability to do atomic
transactions across multiple blobs.

Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).
Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.
Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.

There is also one more factor that get's into this. Price!
Currently operation List and Create container are for the same price:
0,054 US$ / 10.000 calls
Same price is actually for writing the blob.
So in extreme cause you can pay a lot more, if you create and delete many containers
delete is free
you can see the calculator here:
https://azure.microsoft.com/en-us/pricing/calculator/

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning
Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.
Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scale CosmosDB binding for Azure Functions per logical partition - azure

Related

Designing a timer-triggered processor which relies on data from events

Azure Cosmos DB: How to create read replicas for a specific container

How to ensure only one Azure Function BlobTrigger runs at a time?

What to do if Azure SQL Managed Instance reaches the max storage limit?

Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?

Categories

Resources