Azure WebJob not processing all Blobs - azure

I upload gzipped files to an Azure Storage Container (input). I then have a WebJob that is supposed to pick up the Blobs, decompress them and drop them into another Container (output). Both containers use the same storage account.
My problem is that it doesn't process all Blobs. It always seems to miss 1. This morning I uploaded 11 blobs to the input Container and only 10 were processed and dumped into the output Container. If I upload 4 then 3 will be processed. The dashboard will show 10 invocations even though 11 blobs have been uploaded. It doesn't look like it gets triggered for the 11th blob. If I only upload 1 it seems to process it.
I am running the website in Standard Mode with Always On set to true.
I have tried:
Writing code like the Azure Samples (https://github.com/Azure/azure-webjobs-sdk-samples).
Writing code like the code in this article (http://azure.microsoft.com/en-us/documentation/articles/websites-dotnet-webjobs-sdk-get-started).
Using Streams for the input and output instead of CloudBlockBlobs.
Various combinations of closing the input, output and Gzip Streams.
Having the UnzipData code in the Unzip method.
This is my latest code. Am I doing something wrong?
public class Functions
{
public static void Unzip(
[BlobTrigger("input/{name}.gz")] CloudBlockBlob inputBlob,
[Blob("output/{name}")] CloudBlockBlob outputBlob)
{
using (Stream input = inputBlob.OpenRead())
{
using (Stream output = outputBlob.OpenWrite())
{
UnzipData(input, output);
}
}
}
public static void UnzipData(Stream input, Stream output)
{
GZipStream gzippedStream = null;
gzippedStream = new GZipStream(input, CompressionMode.Decompress);
gzippedStream.CopyTo(output);
}
}

As per Victor's comment above it looks like it is a bug on Microsoft's end.
Edit: I don't get the downvote. There is a problem and Microsoft are going to fix it. That is the answer to why some of my blobs are ignored...
"There is a known issue about some Storage log events being ignored. Those events are usually generated for large files. We have a fix for it but it is not public yet. Sorry for the inconvenience. – Victor Hurdugaci Jan 9 at 12:23"

Just as an workaround, what if you don't directly listen to the Blob instead bring a Queue in-between, when you write to the Input Blob Container also write a message about the new Blob in the Queue also, let the WebJob listen to this Queue, once a message arrived to the Queue , the WebJob Function read the File from the Input Blob Container and copied into the Output Blob Container.
Does this model work with you ?

Related

ASP.NET Core Higher memory use uploading files to Azure Blob Storage SDK using v12 compared to v11

I am building a service with an endpoint that images and other files will be uploaded to, and I need to stream the file directly to Blob Storage. This service will handle hundreds of images per second, so I cannot buffer the images into memory before sending it to Blob Storage.
I was following the article here and ran into this comment
Next, using the latest version (v12) of the Azure Blob Storage libraries and a Stream upload method. Notice that it’s not much better than IFormFile! Although BlobStorageClient is the latest way to interact with blob storage, when I look at the memory snapshots of this operation it has internal buffers (at least, at the time of this writing) that cause it to not perform too well when used in this way.
But, using almost identical code and the previous library version that uses CloudBlockBlob instead of BlobClient, we can see a much better memory performance. The same file uploads result in a small increase (due to resource consumption that eventually goes back down with garbage collection), but nothing near the ~600MB consumption like above
I tried this and found that yes, v11 has considerably less memory usage compared to v12! When I ran my tests with about a ~10MB file the memory, each new upload (after initial POST) jumped the memory usage 40MB, while v11 jumped only 20MB
I then tried a 100MB file. On v12 the memory seemed to use 100MB nearly instantly each request and slowly climbed after that, and was over 700MB after my second upload. Meanwhile v11 didn't really jump in memory, though it would still slowly climb in memory, and ended with around 430MB after the 2nd upload.
I tried experimenting with creating BlobUploadOptions properties InitialTransferSize, MaximumConcurrency, etc. but it only seemed to make it worse.
It seems unlikely that v12 would be straight up worse in performance than v11, so I am wondering what I could be missing or misunderstanding.
Thanks!
Sometimes this issue may occur due to Azure blob storage (v12) libraries.
Try to upload the large files in chunks [a technique called file chunking which breaks the large file into smaller chunks for each upload] instead of uploading whole file. Please refer this link
I tried  producing the scenario in my lab
public void uploadfile()
{
string connectionString = "connection string";
string containerName = "fileuploaded";
string blobName = "test";
string filePath = "filepath";
BlobContainerClient container = new BlobContainerClient(connectionString, containerName);
container.CreateIfNotExists();
// Get a reference to a blob named "sample-file" in a container named "sample-container"
BlobClient blob = container.GetBlobClient(blobName);
// Upload local file
blob.Upload(filePath);
}
The output after uploading file.

Azure Function App copy blob from one container to another using startCopy in java

I am using java to write a Azure Function App which is eventgrid trigger and the trigger is blobcreated. So when ever blob is created it will be trigerred and the function is to copy a blob from one container to another. I am using startCopy function from com.microsoft.azure.storage.blob. It was working fine but sometimes, It uses to copy files of zero bytes which are actually containing some data in source location. So at destination sometimes it dumps zero bytes of files. I would like to have a little help on this so that I could understand how to possibly handle this situation
CloudBlockBlob cloudBlockBlob = container.getBlockBlobReference(blobFileName);
CloudStorageAccount storageAccountdest = CloudStorageAccount.parse("something");
CloudBlobClient blobClientdest = storageAccountdest.createCloudBlobClient();
CloudBlobContainer destcontainer = blobClientdest.getContainerReference("something");
CloudBlockBlob destcloudBlockBlob = destcontainer.getBlockBlobReference();
destcloudBlockBlob.startCopy(cloudBlockBlob);
Copying blobs across storage accounts is an async operation. When you call startCopy method, it just signals Azure Storage to copy a file. Actual file copy operation happens asynchronously and may take some time depending how how large file you're transferring.
I would suggest that you check the copy operation progress on the target blob to see how many bytes have been copied and if there's a failure in the copy operation. You can do so by fetching the properties of the target blob. A copy operation could potentially fail if the source blob is modified after the copy operation has started by Azure Storage.
had the same problem, and later figured out from the docs
Event Grid isn't a data pipeline, and doesn't deliver the actual
object that was updated
Event grid will tell you that something has changed and that the actual message has a size limit and as long as the data that you are copying is within that limit it will be successful if not it will be 0 bytes. I was able to copy upto 1mb and beyond that it resulted 0 bytes. You can try and see if azure has increased by size limit in the recent.
However if you want to copy the complete data then you need to use Event Hub or Service Bus. For mine, I went with service bus.

Azure WebJob: BlobTrigger vs QueueTrigger resource usage

I created a WebJob project to backup images from Azure to Amazon that used the BlobTrigger attribute for the first parameter in the method
public static async Task CopyImage([BlobTrigger("images/{name}")] ICloudBlob image, string name, TextWriter log)
{
var imageStream = new MemoryStream();
image.DownloadToStream(imageStream);
await S3ImageBackupContext.UploadImageAsync(name, imageStream);
}
Then I read that the BlobTrigger is based on the best effort basis in the document How to use Azure blob storage and changed it to a QueueTrigger.
Both works perfectly fine :-) so it's not a problem but a question. Since I deployed the change the CPU and Memory usage of the WebJob looks like
Can somebody explain me the reason for the drop of memory and CPU usage? Also the data egress went down.
Very interesting.
I think you're the only one who can answer that question.
Do a remote profile for both blob and queue versions, see which method eats up that CPU time:
https://azure.microsoft.com/en-us/blog/remote-profiling-support-in-azure-app-service/
For memory consumption you probably need to grab a memory dump:
https://blogs.msdn.microsoft.com/waws/2015/07/01/create-a-memory-dump-for-your-slow-performing-web-app/

Setup webjob ServiceBusTriggers or queue names at runtime (without hard-coded attributes)?

Is there any way to configure triggers without attributes? I cannot know the queue names ahead of time.
Let me explain my scenario here.. I have one service bus queue, and for various reasons (complicated duplicate-suppression business logic), the queue messages have to be processed one at a time, so I have ServiceBusConfiguration.OnMessageOptions.MaxConcurrentCalls set to 1. So processing a message holds up the whole queue until it is finished. Needless to say, this is suboptimal.
This 'one at a time' policy isn't so simple. The messages could be processed in parallel, they just have to be divided into groups (based on a field in message), say A and B. Group A can process its messages one at a time, and group B can process its own one at a time, etc. A and B are processed in parallel, all is good.
So I can create a queue for each group, A, B, C, ... etc. There are about 50 groups, so 50 queues.
I can create a queue for each, but how to make this work with the Azure Webjobs SDK? I don't want to copy-paste a method for each queue with a different ServiceBusTrigger for the SDK to discover, just to enforce one-at-a-time per queue/group, then update the code with another copy-paste whenever another group is needed. Fetching a list of queues at startup and tying to the function is preferable.
I have looked around and I don't see any way to do what I want. The ITypeLocator interface is pretty hard-set to look for attributes. I could probably abuse the INameResolver, but it seems like I'd still have to have a bunch of near-duplicate methods around. Could I somehow create what the SDK is looking for at startup/runtime?
(To be clear, I know how to use INameResolver to get queue name as at How to set Azure WebJob queue name at runtime? but though similar this isn't my problem. I want to setup triggers for multiple queues at startup for the same function to get the one-at-a-time per queue processing, without using the trigger attribute 50 times repeatedly. I figured I'd ask again since the SDK repo is fairly active and it's been a year..).
Or am I going about this all wrong? Being dumb? Missing something? Any advice on this dilemma would be welcome.
The Azure Webjob Host discovers and indexes the functions with the ServiceBusTrigger attribute when it starts. So there is no way to set up the queues to trigger at the runtime.
The simpler solution for you is to create a long time running job and implement it manually:
public class Program
{
private static void Main()
{
var host = new JobHost();
host.CallAsync(typeof(Program).GetMethod("Process"));
host.RunAndBlock();
}
[NoAutomaticTriggerAttribute]
public static async Task Process(TextWriter log, CancellationToken token)
{
var connectionString = "myconnectionstring";
// You can also get the queue name from app settings or azure table ??
var queueNames = new[] {"queueA", "queueA" };
var messagingFactory = MessagingFactory.CreateFromConnectionString(connectionString);
foreach (var queueName in queueNames)
{
var receiver = messagingFactory.CreateMessageReceiver(queueName);
receiver.OnMessage(message =>
{
try
{
// do something
....
// Complete the message
message.Complete();
}
catch (Exception ex)
{
// Log the error
log.WriteLine(ex.ToString());
// Abandon the message so that it can be retry.
message.Abandon();
}
}, new OnMessageOptions() { MaxConcurrentCalls = 1});
}
// await until the job stop or restart
await Task.Delay(Timeout.InfiniteTimeSpan, token);
}
}
Otherwise, if you don't want to deal with multiple queues, you can have a look at azure servicebus topic/subscription and create SqlFilter to send your message to the right subscription.
Another option could be to create your own trigger: The azure webjob SDK provides extensibility points to create your own trigger binding :
Binding Extensions Overview
Good Luck !
Based on my understanding, your needs seems to be building a message batch system in parallel. The #Thomas solution is good, but I think Azure Batch service with Table storage may be better and could be instead of the complex solution of ServiceBus queue + WebJobs with a trigger.
Using Azure Batch with Table storage, you can control the task creation and execute the task in parallel and at scale, even monitor these tasks, please refer to the tutorial to know how to.

Download of large blob from Azure Blob Storage was frozen

We use Azure Blob Storage as a storage for large files (~5-20Gb)
Our customer reported about problem. Download of such files sometimes stops and never ends.
I have logged download statistics and tried to download problem file several times. One of the attempt was unsuccessful and now I have a chart downloaded_size/time:
There were short pauses in downloading from 16:00 to 16:12. Intervals between pauses are identical, but length raises. At 16:12 speed become Kb/s and never returned to normal values.
Here is a code that proceed downloading (.NET 4.0):
CloudBlobContainer container = new CloudBlobContainer(new Uri(containerSAS));
CloudBlockBlob blockBlob = container.GetBlockBlobReference(blobName);
var options = new BlobRequestOptions()
{
ServerTimeout = new TimeSpan(10, 0, 0, 0),
RetryPolicy = new LinearRetry(TimeSpan.FromMinutes(2), 100),
};
blockBlob.DownloadToStream(outputStream, null, options);
What could be a reason of such problems?
EDIT To get statistics I used the following Stream implementation:
public class TestControlledFileStream : Stream
{
private StreamWriter _Writer;
private long _Size;
public TestControlledFileStream(string filename)
{
this._Writer = new StreamWriter(filename);
}
public override void Write(byte[] buffer, int offset, int count)
{
_Size += count;
_Writer.WriteLine("{0}: ({1}, {2})", DateTime.UtcNow, _Size, count);
}
protected override void Dispose(bool disposing)
{
if (this._Writer != null)
this._Writer.Dispose();
}
}
It looks like you are having issues downloading a large blob. I have provided some steps below to help you debug these kinds of issues.
Quick comment: You don’t need to specify the linear retry policy for this download. The default exponential retry policy should suffice for your scenario. By setting the number of allowed failures for this retry policy to 100 you may be prolonging a problem that retrying won’t fix. Is there a reason you chose to use linear retry?
Debugging Azure Storage Issues
Storage Analytics logs are a useful first step in debugging as they
give you information about your requests such as server latency or
response status code. You can check what info is available here
and learn how to use logging here.
Client-side logging
provides client specific information that can help you narrow down
your issue.
Using Fiddler on your client can provide visibility for the state of your download such as if the client is retrying or not. Note that this may obfuscate speed or connection issues as the traffic will run the traffic through a proxy.
If you get more detailed information regarding this issue, we will be able to help mitigate your issue.

Resources