Azure BlobTrigger is firing > 5 times for each new blob - azure

The following trigger removes exif data from blobs (which are images) after they are uploaded to azure storage. The problem is that the blob trigger fires at least 5 times for each blob.
In the trigger the blob is updated by writing a new stream of data to it. I had assumed that blob receipts would prevent further firing of the blob trigger against this blob.
[FunctionName("ExifDataPurge")]
public async System.Threading.Tasks.Task RunAsync(
[BlobTrigger("container/{name}.{extension}", Connection = "")]CloudBlockBlob image,
string name,
string extension,
string blobTrigger,
ILogger log)
{
log.LogInformation($"C# Blob trigger function Processed blob\n Name:{name}");
try
{
var memoryStream = new MemoryStream();
await image.DownloadToStreamAsync(memoryStream);
memoryStream.Position = 0;
using (Image largeImage = Image.Load(memoryStream))
{
if (largeImage.Metadata.ExifProfile != null)
{
//strip the exif data from the image.
for (int i = 0; i < largeImage.Metadata.ExifProfile.Values.Count; i++)
{
largeImage.Metadata.ExifProfile.RemoveValue(largeImage.Metadata.ExifProfile.Values[i].Tag);
}
var exifStrippedImage = new MemoryStream();
largeImage.Save(exifStrippedImage, new SixLabors.ImageSharp.Formats.Jpeg.JpegEncoder());
exifStrippedImage.Position = 0;
await image.UploadFromStreamAsync(exifStrippedImage);
}
}
}
catch (UnknownImageFormatException unknownImageFormatException)
{
log.LogInformation($"Blob is not a valid Image : {name}.{extension}");
}
}

Triggers are handled in such a way that they track which blobs have been processed by storing receipts in container azure-webjobs-hosts. Any blob not having a receipt, or an old receipt (based on blob ETag) will be processed (or reprocessed).
since you are calling await image.UploadFromStreamAsync(exifStrippedImage); it gets triggered (assuming its not been processed)

When you call await image.UploadFromStreamAsync(exifStrippedImage);, it will update blob so the blob function will trigger again.
You can try to check the existing CacheControl property on the blob to not update it if it has been updated to break the loop.
// Set the CacheControl property to expire in 1 hour (3600 seconds)
blob.Properties.CacheControl = "max-age=3600";

So I've addressed this by storing a Status in metadata against the blob as it's processed.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-container-properties-metadata
The trigger then contains a guard to check for the metadata.
if (image.Metadata.ContainsKey("Status") && image.Metadata["Status"] == "Processed")
{
//an subsequent processing for the blob will enter this block.
log.LogInformation($"blob: {name} has already been processed");
}
else
{
//first time triggered for this blob
image.Metadata.Add("Status", "Processed");
await image.SetMetadataAsync();
}
The other answers pointed me in the right direction. I think it is more correct to use the metadata. Storing an ETag elsewhere seems redundant when we can store metadata. The use of "CacheControl" seems like too much of a hack, other developers might be confused as to what and why I have done it.

Related

storage account - export blob size and date [duplicate]

I want to create an Azure function that deletes files from azure blob storage when last modifed older than 30 days.
Can anyone help or have a documentation to do that?
Assuming your storage account's type is either General Purpose v2 (GPv2) or Blob Storage, you actually don't have to do anything by yourself. Azure Storage can do this for you.
You'll use Blob Lifecycle Management and define a policy there to delete blobs if they are older than 30 days and Azure Storage will take care of deletion for you.
You can learn more about it here: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-lifecycle-management-concepts.
You can create a Timer Trigger function, fetch the list of items from the Blob Container and delete the files which does not match your criteria of last modified date.
Create a Timer Trigger function.
Fetch the list of blobs using CloudBlobContainer.
Cast the blob items to proper type and check LastModified property.
Delete the blob which doesn't match criteria.
I hope that answers the question.
I have used HTTP as the trigger as you didn't specify one and it's easier to test but the logic would be the same for a Timer trigger etc. Also assumed C#:
[FunctionName("HttpTriggeredFunction")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
[Blob("sandbox", Connection = "StorageConnectionString")] CloudBlobContainer container,
ILogger log)
{
log.LogInformation("C# HTTP trigger function processed a request.");
// Get a list of all blobs in your container
BlobResultSegment result = await container.ListBlobsSegmentedAsync(null);
// Iterate each blob
foreach (IListBlobItem item in result.Results)
{
// cast item to CloudBlockBlob to enable access to .Properties
CloudBlockBlob blob = (CloudBlockBlob)item;
// Calculate when LastModified is compared to today
TimeSpan? diff = DateTime.Today - blob.Properties.LastModified;
if (diff?.Days > 30)
{
// Delete as necessary
await blob.DeleteAsync();
}
}
return new OkObjectResult(null);
}
Edit - How to download JSON file and deserialize to object using Newtonsoft.Json:
public class MyClass
{
public string Name { get; set; }
}
var json = await blob.DownloadTextAsync();
var myClass = JsonConvert.DeserializeObject<MyClass>(json);

Azure Blob Change Feed missing blob append events

I'm trying out Azure Blob Change Feed feature and it behaves strange to me with Append Blobs: append events are missing in the feed.
My scenario is:
Create storage account, enable change feed feature:
Change feed enabled
Create Append Blob if not exists (1) and appending some input into it (2).
private void WriteBlob(string input)
{
MemoryStream stream = new MemoryStream(Encoding.UTF8.GetBytes(input));
try
{
if (client == null)
{
var credential = new ClientSecretCredential("...", "...");
client = new AppendBlobClient(new Uri("..."), credential);
}
client.CreateIfNotExists(); // (1)
client.AppendBlock(stream); // (2)
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
Fetch Change Feed entries in separate console app.
public static List<BlobChangeFeedEvent> GetChanges()
{
var credential = new ClientSecretCredential("...", "...");
BlobChangeFeedClient blobChangeFeedClient = new BlobChangeFeedClient(new Uri("..."), credential);
List<BlobChangeFeedEvent> events = new List<BlobChangeFeedEvent>();
foreach (BlobChangeFeedEvent changeFeedEvent in blobChangeFeedClient.GetChanges())
{
events.Add(changeFeedEvent);
}
return events;
}
The problem is that after a few runs of WriteBlob method I only get single change feed event that corresponds to the blob creation, and subsequent appends are missing in the feed, however inputs are being appended successfully to the blob resource.
The question is why it is working this way? I didn't find anything special about Append Blob blob type regarding Change feed in docs.
Currently, the append event for an append blob is not supported.
As per this doc, only the following event types are supported:
BlobCreated
BlobDeleted
BlobPropertiesUpdated
BlobSnapshotCreated
And in the source code of Azure.Storage.Blobs.ChangeFeed package, there is no append event type.
A feature request of this is submitted, hope it can be added in the future release.

Delete files older than X number of days from Azure Blob Storage using Azure function

I want to create an Azure function that deletes files from azure blob storage when last modifed older than 30 days.
Can anyone help or have a documentation to do that?
Assuming your storage account's type is either General Purpose v2 (GPv2) or Blob Storage, you actually don't have to do anything by yourself. Azure Storage can do this for you.
You'll use Blob Lifecycle Management and define a policy there to delete blobs if they are older than 30 days and Azure Storage will take care of deletion for you.
You can learn more about it here: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-lifecycle-management-concepts.
You can create a Timer Trigger function, fetch the list of items from the Blob Container and delete the files which does not match your criteria of last modified date.
Create a Timer Trigger function.
Fetch the list of blobs using CloudBlobContainer.
Cast the blob items to proper type and check LastModified property.
Delete the blob which doesn't match criteria.
I hope that answers the question.
I have used HTTP as the trigger as you didn't specify one and it's easier to test but the logic would be the same for a Timer trigger etc. Also assumed C#:
[FunctionName("HttpTriggeredFunction")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
[Blob("sandbox", Connection = "StorageConnectionString")] CloudBlobContainer container,
ILogger log)
{
log.LogInformation("C# HTTP trigger function processed a request.");
// Get a list of all blobs in your container
BlobResultSegment result = await container.ListBlobsSegmentedAsync(null);
// Iterate each blob
foreach (IListBlobItem item in result.Results)
{
// cast item to CloudBlockBlob to enable access to .Properties
CloudBlockBlob blob = (CloudBlockBlob)item;
// Calculate when LastModified is compared to today
TimeSpan? diff = DateTime.Today - blob.Properties.LastModified;
if (diff?.Days > 30)
{
// Delete as necessary
await blob.DeleteAsync();
}
}
return new OkObjectResult(null);
}
Edit - How to download JSON file and deserialize to object using Newtonsoft.Json:
public class MyClass
{
public string Name { get; set; }
}
var json = await blob.DownloadTextAsync();
var myClass = JsonConvert.DeserializeObject<MyClass>(json);

Updating many json files in Azure blob storage?

I have about ~1,000,000 json files, that I would like to update every 30 minutes. The update is simply appending a new array to the end of the existing content.
A single update uses code similar to:
CloudBlockBlob blockBlob = container.GetBlockBlobReference(blobName);
JObject jObject = null;
// If the blob exists, then we may need to update it.
if(blockBlob.Exists())
{
MemoryStream memoryStream = new MemoryStream();
blockBlob.DownloadToStream(memoryStream);
jObject = JsonConvert.DeserializeObject(Encoding.UTF8.GetString(memoryStream.ToArray())) as JObject;
} // End of the blob exists
if(null == jObject)
{
jObject = new JObject();
jObject.Add(new JProperty("identifier", identifier));
} // End of the blob did not exist
JArray jsonArray = new JArray();
jObject.Add(new JProperty(string.Format("entries{0}", timestamp.ToString()),jsonArray));
foreach(var entry in newEntries)
{
jsonArray.Add(new JObject(
new JProperty("someId", entry.id),
new JProperty("someValue", value)
)
);
} // End of loop
string jsonString = JsonConvert.SerializeObject(jObject);
// Upload
blockBlob.Properties.ContentType = "text/json";
blockBlob.UploadFromStream(new MemoryStream(Encoding.UTF8.GetBytes(jsonString)));
Basically:
Check if the blob exists,
If it does, download the data and create a json object from the existing details.
If it did not, then create a new object with the details.
Push the update to the blob.
The problem with this is performance. I've done quite a few things I can do to increase performance (the updates run in five parallel threads and I have set ServicePointManager.UseNagleAlgorithm to false.
It still runs slow though. Roughly ~100,000 updates can take up to an hour.
So I guess basically, my questions would be:
Should I be using Azure Blob storage for this? (I'm open to alternative suggestions).
If so, any suggestions on improving performance?
Note: The file basically contains a history of events and I cannot re-generate the entire file based on existing data. This is why the contents are downloaded before being updated.

Blob does not exist when using BeginUploadFromStream

After I use CloudBlob.BeginUploadFromStream() method to upload a file, I later get a StorageClientException with StorageErrorCode.ResourceNotFound when trying to retrieve the file for a download. If I upload the same file using CloudBlob.UploadFromStream() method, then the blob DOES exist and i can download it.
here's my download code:
var client = _storageAccount.CreateCloudBlobClient();
var container = client.GetContainerReference(BLOB_CONTAINER_DOCUMENTS_ADDRESS);
container.CreateIfNotExist();
string blobName = id.ToString();
var newBlob = container.GetBlobReference(blobName);
if (newBlob.Exists())
{
var stream = newBlob.OpenRead();
return stream;
}
else
{
throw new Exception("Blob does not exist!");
}
Exists is an extension method. I'm getting the StorageClientException with the error code ResourceNotFound when I use the BeginUploadFromStream() method
public static bool Exists(this CloudBlob blob)
{
try
{
blob.FetchAttributes();
return true;
}
catch (StorageClientException e)
{
if (e.ErrorCode == StorageErrorCode.ResourceNotFound)
{
return false;
}
else
{
throw;
}
}
}
And my call to upload
var blob = container.GetBlobReference(blobName);
This will NOT throw an exception when i later check if the blob exists
blob.UploadFromStream(fileStream);
This will
AsyncCallback uploadCompleted = new AsyncCallback(OnUploadCompleted);
blob.BeginUploadFromStream(fileStream, uploadCompleted, documentId);
EDIT
As suggested, i didn't have a call to EndUploadFromStream() method. Here is my updated call to upload:
blob.BeginUploadFromStream(fileStream, uploadCompleted, blob);
And my handler
private void OnUploadCompleted(IAsyncResult result)
{
var blob = (CloudBlob) result.AsyncState;
blob.EndUploadFromStream(result);
}
Running this, the EndUploadFromStream() method throws a WebException with the msg: "The request was aborted: The request was canceled." The InnerException is "Cannot close stream until all bytes are written."
Anyone have any idea what's going on here?
BeginUploadFromStream uploads the blob asynchronously, so your method proceeds while the blob uploads on a thread in the background. If the blob hasn't finished uploading -- or if Azure hasn't been told that the upload has completed -- you won't see the blob in storage. Only blobs uploaded through successfully completed transactions are visible.
Could you post the code for OnUploadCompleted?
It looks at first glance as if either the blob is still uploading -- or you've forgotten to call EndUploadFromStream() in your OnUploadCompleted method.
What it sounds like is happening is IIS is cancelling the thread that is being initiated to make the BeginUploadFromStream. Since the storage API is really just manipulating a bunch of REST calls under the hood you can think of these storage calls as web service calls and not like traditional IO.
Check out this topic on HttpKeepAlives, this might solve your problem but as the article pointed out it may impact performance of your site. So you may want to add logic to only enable the keep alive for the requests that are performing the upload.
http://www.jaxidian.org/update/2007/05/05/8/

Resources