Very high number of file transaction when using Azure Functions with Python

Very high number of file transaction when using Azure Functions with Python - azure

I am building a function app that is trigger by a queue message, reads some input files from Blob Storage, combines them and writes a new file to Blob storage.
Each time the function runs, I see a very high amount of file transactions resulting in unexpected costs. The costs are related to "File Write/Read/Protocol Operation Units".
The function has a Queue Trigger binding, three input bindings pointing to blob storage and an Output binding pointing to blob storage
The Function App is running on Python (which I know is experimental)
When looking at the metrics of my storage account I see spikes going to 50k file transactions for each time I run my function. Testing with an empty function triggered by a Queue Message, I also get 5k file transactions.
Normally the function writes the output to the output binding location (which for Python function is a temporary file on the Function App storage, which is then copied back to Blob storage, I presume)
In this related question, the high costs for the storage are suspected to be related with logging. In my case logging is not enabled in the hosts.json file and I've disabled logging on the storage account also. This hasn't resolved the issue. (Expensive use of storage account from Azure Functions)
Are these values normal for an output file of 60KB and an input file of around 2MB?
Is this related to the Python implementation or is this to be expected for all languages?
Can I avoid this?

The python implementation in V1 functions creates inefficiencies that could lead to significant file usage. This is a known shortcoming. There is a work in progress on a python implementation for functions V2 that will not have this problem.

Related

Slow down EventHubTrigger in Azure Function

I have a simple Azure function which:
as input uses EventHubTrigger
internally write some events to the Azure Storage
During some part of the day the average batch size is around 250+, which is great because I write less block to Azure Storage, but for most of the time the batch size is less then 10.
Is there anyway to force EventHubTrigger to wait until there are more than 50/100/200 messages to process, so I can reduce append blocks in Azure Storage.

Update--
The host.json file contains settings that control Event Hub trigger behavior. See the host.json settings section for details regarding available settings.
You can specify a timestamp such that you receive events enqueued only after the given timestamp.
initialOffsetOptions/enqueuedTimeUtc: Specifies the enqueued time of the event in the stream from which to start processing. When initialOffsetOptions/type is configured as fromEnqueuedTime, this setting is mandatory. Supports time in any format supported by DateTime.Parse(), such as 2020-10-26T20:31Z. For clarity, you should also specify a timezone. When timezone isn't specified, Functions assumes the local timezone of the machine running the function app, which is UTC when running on Azure. For more information, see the EventProcessorOptions documentation.
---
If I understand your ask correctly, you want to hold all the received events until a certain threshold and then process them at once in a singe Azure function run.
To receive events in a batch, make string or EventData an array. In the binding configuration properties that you set in the function.json file and the EventHubTrigger attribute. Set function.json property "cardinality" to "many" in order to enable batching. If omitted or set to one, a single message is passed to the function. In C#, this property is automatically assigned whenever the trigger has an array for the type.
Note: When receiving in a batch you cannot bind to method parameters with event metadata and must receive these from each EventData object
Further, I could not find an explicit way to do this task. You would have to modify your function, like using a timer trigger or use the blob queue storage API to read all messages in a buffer, and then write to a blob via blob binding.

Azure Batch and use cases for JobManagerTask

I am currently digging into the Azure Batch service, and I am confused about the proper use of a JobManagerTask...
...and maybe what the overall architecture of an Azure Batch application should look like. I have built the below architecture based on code samples from Microsoft found on Github .
These are my current application components.
App1 - ClusterHead
Creates a job (including an auto pool)
Defines the JobManagerTask
Runs on a workstation
App2 - JobManagerTask
Splits input data into chunks
Pushes chunks (unit of work) onto an input queue
Creates tasks (CloudTask)
App3 - WorkloadRunner
Pulls from the input queue
Executes the task
Pushes to the output queue
Azure Storage Account
Linked to Azure Batch account
Provides input & output queues
Provides a result table
Azure Durable Function
Implements the aggregator pattern by using DurableEntities so that I can access incoming results prematurely.
Gets triggered by messages in the output queue
Aggregates results and writes the entity to Azure Storage table
Questions
Is that proper use of the JobManagerTask?
Why do I want/need the extra binary/application package, that encapsulates the JobManagerTask?
Could someone please give an example of when I should prefer to use a JobManagerTask over creating the Jobs manually?
Thanks in advance!

Your example is an example of a how a JobManagerTask can be used, albeit as you mentioned if the data being generated is all generated by the JobManagerTask and is fixed then it could make sense to just merge it into your ClusterHead as you mentioned. In your case it just depends if you want the split and upload of your data to occur as part of the Job or run it on the workstation.
One area JobManagerTasks excel is if the data coming in is continuous. Basically if you had a bunch of writers to a raw input queue, you could have your JobManagerTask run continuously reading from that queue and splitting the data/creating the tasks.

Parallelize Azure Logic App executions when copying a file from SFTP to Blob Storage

I have an Azure Logic App which gets triggered when a new file is added or modified in an SFTP server. When that happens the file is copied to Azure Blob Storage and then gets deleted from the SFTP server. This operation takes approximately 2 seconds per file.
The only problem I have is that these files (on average 500kb) are processed one by one. Given that I'm looking to transfer around 30,000 files daily this approach becomes very slow (something around 18 hours).
Is there a way to scale out/parallelize these executions?

I am not sure that there is a scale out/parallelize execution on Azure Logic App. But based on my experience, if the timeliness requirements are not very high, we could use Foreach to do that, ForEach parallelism limit is 50 and the default is 20.
In your case, my suggestion is that we could do loop to trigger when a new file is added or modified in an SFTP then we could insert a queue message with file path as content to azure storage queue, then according to time or queue length to end the loop. We could get the queue message collection. Finally, fetch the queue message and fetch the files from the SFTP to create blob in the foreach action.

If you're C# use Parallel.ForEach like Tom Sun said. If you use this one I also recommend to use async/await pattern for IO operation (save to blob). It will free up the executing thread when file is being saved to serve some other request.

How to persist state in Azure Function (the cheap way)?

How can I persist a small amount of data between Azure Function executions? Like in a global variable? The function runs on a Timer Trigger.
I need to store the result of one Azure Function execution and use this as input of the next execution of the same function. What is the cheapest (not necessarily simplest) way of storing data between function executions?
(Currently I'm using the free amount of Azure Functions that everyone gets and now I'd like to save state in a similar free or cheap way.)

There are a couple options - I'd recommend that you store your state in a blob.
You could use a blob input binding to read global state for every execution, and a blob output binding to update that state.
You could also remove the timer trigger and use queues, with the state stored in the queue message and a visibility timeout on the message to set the schedule (i.e next execution time).
Finally, you could use a file on the file system, as it is shared across the function app.
If you can accept the possibility of data loss and only care at the instance level, you can:
maintain a static data structure
write to instance local storage

Durable entities are now available to handle persistence of state.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-entities?tabs=csharp

This is old thread but worth sharing the new way to handle state in Azure function.
Now we have the durable function approach from Microsoft itself where we can maintain the function state very easily and effectively. Please refer the below documentation from MS.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview

Azure Queue Storage: Send files in messages

I am assessing Azure Queue Storage to communicate between two decoupled applications.
My requirement is to send a file (flat file, size: small to large) in the queue message.
As per my reading an individual message in a queue cannot exceed beyond 64KB, so sending a file of variable size in the message is out of question.
Another solution I can think of is using a combination of Queue Storage and blob storage, i.e. in the queue message add a reference to the file (on blob storage) and then when required read the file from the blob (using the reference/address in the queue message).
My question is, is this a right approach? or are there any other elegant ways to achieving this?
Thanks,
Sandeep

While there's no right approach, since you can put anything you want in a queue message (within size limits), consider this: If your file sizes can go over 64K, you simply cannot store these within a queue message, so you will have no other choice but to store your content somewhere else (e.g. blobs). For files under 64K, you'll need to decide whether you want two different methods for dealing with files, or just use blobs as your file source across the board and have a consistent approach.
Also remember that message-passing will eat up bandwidth and processing. If you store your files in queue messages, you'll need to account for this with high-volume message-passing, and you'll also need to extract your file content from your queue messages.
One more thing: If you store content in blobs, you can use any number of tools to manipulate these files, and your files remain in blob storage permanently (until you explicitly delete them). Queue messages must be deleted after processing, giving you no option to keep your file around. This is probably an important aspect to consider.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Very high number of file transaction when using Azure Functions with Python - azure

The python implementation in V1 functions creates inefficiencies that could lead to significant file usage. This is a known shortcoming. There is a work in progress on a python implementation for functions V2 that will not have this problem.

Related

Slow down EventHubTrigger in Azure Function

Azure Batch and use cases for JobManagerTask

Parallelize Azure Logic App executions when copying a file from SFTP to Blob Storage

How to persist state in Azure Function (the cheap way)?

Azure Queue Storage: Send files in messages

Categories

Resources