Uploading files to Azure blob storage taking more time for larger files - azure

Hi All...
I am trying to uploading the lager file (size more than 100 MB) files
to Azure blob storage.Below is the code.
My problem is even though I have used BeginPutBlock with TPL (Task
Parallelism) it is taking more time (20 Min for 100 MB uploading). But
i have to upload the files more than 2 GB size. Can anyone please help
me on this.
namespace BlobSamples {
public class UploadAsync
{
static void Main(string[] args)
{
//string filePath = #"D:\Frameworks\DNCMag-Issue26-DoubleSpread.pdf";
string filePath = #"E:\E Books\NageswaraRao Meterial\ebooks\applied_asp.net_4_in_context.pdf";
string accountName = "{account name}";
string accountKey = "{account key}";
string containerName = "sampleContainer";
string blobName = Path.GetFileName(filePath);
//byte[] fileContent = File.ReadAllBytes(filePath);
Stream fileContent = System.IO.File.OpenRead(filePath);
StorageCredentials creds = new StorageCredentials(accountName, accountKey);
CloudStorageAccount storageAccount = new CloudStorageAccount(creds, useHttps: true);
CloudBlobClient blobclient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobclient.GetContainerReference(containerName);
CloudBlockBlob blob = container.GetBlockBlobReference(blobName);
// Define your retry strategy: retry 5 times, starting 1 second apart
// and adding 2 seconds to the interval each retry.
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1),
TimeSpan.FromSeconds(2));
// Define your retry policy using the retry strategy and the Azure storage
// transient fault detection strategy.
var retryPolicy =
new RetryPolicy<StorageTransientErrorDetectionStrategy>(retryStrategy);
// Receive notifications about retries.
retryPolicy.Retrying += (sender, arg) =>
{
// Log details of the retry.
var msg = String.Format("Retry - Count:{0}, Delay:{1}, Exception:{2}",
arg.CurrentRetryCount, arg.Delay, arg.LastException);
};
Console.WriteLine("Upload Started" + DateTime.Now);
ChunkedUploadStreamAsync(blob, fileContent, (1024*1024), retryPolicy);
Console.WriteLine("Upload Ended" + DateTime.Now);
Console.ReadLine();
}
private static Task PutBlockAsync(CloudBlockBlob blob, string id, Stream stream, RetryPolicy policy)
{
Func<Task> uploadTaskFunc = () => Task.Factory
.FromAsync(
(asyncCallback, state) => blob.BeginPutBlock(id, stream, null, null, null, null, asyncCallback, state)
, blob.EndPutBlock
, null
);
Console.WriteLine("Uploaded " + id + DateTime.Now);
return policy.ExecuteAsync(uploadTaskFunc);
}
public static Task ChunkedUploadStreamAsync(CloudBlockBlob blob, Stream source, int chunkSize, RetryPolicy policy)
{
var blockids = new List<string>();
var blockid = 0;
int count;
// first create a list of TPL Tasks for uploading blocks asynchronously
var tasks = new List<Task>();
var bytes = new byte[chunkSize];
while ((count = source.Read(bytes, 0, bytes.Length)) != 0)
{
var id = Convert.ToBase64String(BitConverter.GetBytes(++blockid));
blockids.Add(id);
tasks.Add(PutBlockAsync(blob, id, new MemoryStream(bytes, true), policy));
bytes = new byte[chunkSize]; //need a new buffer to avoid overriding previous one
}
return Task.Factory.ContinueWhenAll(
tasks.ToArray(),
array =>
{
// propagate exceptions and make all faulted Tasks as observed
Task.WaitAll(array);
policy.ExecuteAction(() => blob.PutBlockListAsync(blockids));
Console.WriteLine("Uploaded Completed " + DateTime.Now);
});
}
} }

If you can accept command line tool, you can try AzCopy, which is able to transfer Azure Storage data in high performance and its transferring can be resumed.
If you want to control the transferring jobs programmatically, please use Azure Storage Data Movement Library, which is the core of AzCopy.

As I known, Block blobs are made up of blocks. A block could up to 4MB in size. According to your code, you set the block size to 1MB and programatically uploaded each block in parallel. For a simple way, you could leverage the property ParallelOperationThreadCount to upload blob blocks in parallel as follows:
//set the number of blocks that may be simultaneously uploaded
var requestOption = new BlobRequestOptions()
{
ParallelOperationThreadCount = 5,
//Gets or sets the maximum size of a blob in bytes that may be uploaded as a single blob
SingleBlobUploadThresholdInBytes = 10 * 1024 * 1024 //maximum for 64MB,32MB by default
};
//upload a file to blob
blob.UploadFromFile("{filepath}", options: requestOption);
Upon the option, when your blob(file) is larger than the value in SingleBlobUploadThresholdInBytes, then the storage client breaks the file into blocks(4MB in size) automatically and upload the blocks simultaneously.
Based on your requirement, I created an ASP.NET Web API application which exposes a API to upload file to Azure Blob Storage.
Project URL: AspDotNet-WebApi-AzureBlobFileUploadSample
Note:
In order to upload large file, you need to increase the maxRequestLength and maxAllowedContentLength in your web.config as follows:
<system.web>
<httpRuntime maxRequestLength="2097152"/> <!--KB in size, 4MB by default, increase it to 2GB-->
</system.web>
<system.webServer>
<security>
<requestFiltering>
<requestLimits maxAllowedContentLength="2147483648" /> <!--Byte in size,increase it to 2GB-->
</requestFiltering>
</security>
</system.webServer>
Screenshot

I'd suggest you use Azcopy when uploading large files, it saves a lot time for coding by yourself and is more efficient. For upload a single file, run the command below:
AzCopy /Source:C:\folder /Dest:https://youraccount.blob.core.windows.net/container /DestKey:key /Pattern:"test.txt"

Related

Azure Blob storage append blob 409 / modified error when refreshing portal / storage explorer whilst appending

I am receiving an error whilst uploading chunks to an append blob in Azure. Left alone, the process works ok but the problem arises the moment I refresh the container with Storage explorer (latest version) or refresh the page in the Azure portal whilst its uploading. My process throws the following.
An exception of type 'Azure.RequestFailedException' occurred in System.Private.CoreLib.dll but was not handled in user code: 'The blob has been modified while being read.
RequestId:62778adb-001e-011e-29a4-d589bf000000
Time:2021-11-09T20:02:29.3183234Z
Status: 409 (The blob has been modified while being read.)
ErrorCode: BlobModifiedWhileReading
Taking a lease out on the file makes no difference.
Test code is
using System;
using System.IO;
using System.Buffers;
using System.Threading.Tasks;
using Azure.Storage.Blobs;
using System.Text;
using Azure.Storage.Blobs.Specialized;
namespace str
{
static class Program
{
static async Task Main(string[] args)
{
const string ContainerName = "files";
const string BlobName = "my.blob";
const int ChunkSize = 4194304; // 4MB
const string connstr = "some-connecting-string-to-your-datalake-gen-2-account";
BlobServiceClient blobClient = new(connstr);
BlobContainerClient containerClient = blobClient.GetBlobContainerClient(ContainerName);
await containerClient.CreateIfNotExistsAsync();
AppendBlobClient appendClient = containerClient.GetAppendBlobClient(BlobName);
await appendClient.CreateIfNotExistsAsync();
using FileStream fs = await FileMaker.CreateNonsenseFileAsync();
using BinaryReader reader = new(fs);
bool readLoop = true;
while (readLoop)
{
byte[] chunk = reader.ReadBytes(ChunkSize);
if (chunk.Length > 0)
await appendClient.AppendBlockAsync(new MemoryStream(chunk));
readLoop = chunk.Length == ChunkSize;
}
fs.Close();
File.Delete(fs.Name);
}
}
public static class FileMaker
{
public static async Task<FileStream> CreateNonsenseFileAsync(int blocks = 30000)
{
string tempFile = Path.GetTempFileName();
FileStream fs = File.OpenWrite(tempFile);
byte[] buffer = ArrayPool<byte>.Shared.Rent(1024);
Random randy = new();
using (StreamWriter writer = new(fs))
{
for (int i = 0; i < blocks; i++)
{
randy.NextBytes(buffer);
await writer.WriteAsync(Encoding.UTF8.GetString(buffer));
}
}
return File.OpenRead(tempFile);
}
}
}
csproj is
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net5.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Azure.Storage.Files.DataLake" Version="12.8.0" />
<PackageReference Include="System.Buffers" Version="4.5.1" />
</ItemGroup>
</Project>
I can only imagine the act of refreshing the page somehow triggers a metadata change and the service gives up enthusiastically but seems fairly arbitrary as you may not know who is poking about in blob storage as you are uploading to it?
As above, left alone with no one refreshing the page in the portal or storage explorer this code works fine and uploads the garbage in 4MB chunks (limit of append blob writes, switching to 2MB chunks doesn't make a difference) without a problem.
I tried to reproduce the scenario in my system not facing the issue that you are facing. Able to get the appended data after the refresh the portal .
OUTPUT
Before refresh the blob
After refresh the blob

.NET Core: Reading Azure Storage Blob into Memory Stream throws NotSupportedException in HttpBaseStream

I want to download a storage blob from Azure and stream it to a client via an .NET Web-App. The blob was uploaded correctly and is visible in my Azure storage account.
Surprisingly, the following throws an exception within HttpBaseStream:
[...]
var blobClient = _containerClient.GetBlobClient(Path.Combine(fileName));
var stream = await blobClient.OpenReadAsync();
return stream;
-> When i step further and return a File (return File(stream, MediaTypeNames.Application.Octet);), the download works as intended.
I tried to push the stream into an MemoryStream, which also fails with the same exception:
[...]
var blobClient = _containerClient.GetBlobClient(Path.Combine(fileName));
var stream = new MemoryStream();
await blobClient.DownloadToAsync(stream);
return stream
->When i step further, returning the file results in a timeout.
How can i fix that? Why do i get this exception - i followed the official quickstart guide from Microsoft.
the following throws an exception within HttpBaseStream
It looks like the HTTP result type is attempting to set the Content-Length header and is reading Length to do so. That would be the natural thing to do. However, it would also be natural to handle the NotSupportedException and just not set Content-Length at all.
If the NotSupportedException only shows up when running in the debugger, then just ignore it.
If the exception is actually thrown to your code (i.e., causing the request to fail), then you'll need to follow the rest of this answer.
First, create a minimal reproducible example and report a bug to the .NET team.
To work around this issue in the meantime, I recommend writing a stream wrapper that returns an already-determined length, which you can get from the Azure blob attributes. E.g.:
public sealed class KnownLengthStreamWrapper : Stream
{
private readonly Stream _stream;
public KnownLengthStreamWrapper(Stream stream, long length)
{
_stream = stream;
Length = length;
}
public override long Length { get; private set; }
... // override all other Stream members and forward to _stream.
}
That should be sufficient to get your app working.
I tried to push the stream into an MemoryStream
This didn't work because you'd need to "rewind" the MemoryStream at some point, e.g.:
var stream = new MemoryStream();
await blobClient.DownloadToAsync(stream);
stream.Position = 0;
return stream;
Check this sample of all the blob options which i have already posted on git working as expected. Reference
public void DownloadBlob(string path)
{
storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudBlobClient client = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = client.GetContainerReference("images");
CloudBlockBlob blockBlob = container.GetBlockBlobReference(Path.GetFileName(path));
using (MemoryStream ms = new MemoryStream())
{
blockBlob.DownloadToStream(ms);
HttpContext.Current.Response.ContentType = blockBlob.Properties.ContentType.ToString();
HttpContext.Current.Response.AddHeader("Content-Disposition", "Attachment; filename=" + Path.GetFileName(path).ToString());
HttpContext.Current.Response.AddHeader("Content-Length", blockBlob.Properties.Length.ToString());
HttpContext.Current.Response.BinaryWrite(ms.ToArray());
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.Close();
}
}

Azure Storage Queue performance

We are migrating a transaction-processing service which was processing messages from MSMQ and storing transacitons in a SQLServer Database to use the Azure Storage Queue (to store the id's of the messages and placing the actual messages in the Azure Storage Blob).
We should at least be able to process 200.000 messages per hour, but at the moment we barely reach 50.000 messages per hour.
Our application requests batches of 250 messages from the Queue (which now takes about 2 seconds to get the id's from the azure queue and about 5 seconds to get the actual data from the azure blob storage) and we're storing this data in one time into the database using a stored procedure accepting a datatable.
Our service also resides in Azure on a virtual machine, and we use the nuget-libraries Azure.Storage.Queues and Azure.Storage.Blobs suggested by Microsoft to access the Azure Storage queue and blob storage.
Does anyone have suggestions how to improve the speed of reading messages from the Azure Queue and then retrieving the data from the Azure Blob?
var managedIdentity = new ManagedIdentityCredential();
UriBuilder fullUri = new UriBuilder()
{
Scheme = "https",
Host = string.Format("{0}.queue.core.windows.net",appSettings.StorageAccount),
Path = string.Format("{0}", appSettings.QueueName),
};
queue = new QueueClient(fullUri.Uri, managedIdentity);
queue.CreateIfNotExists();
...
var result = await queue.ReceiveMessagesAsync(1);
...
UriBuilder fullUri = new UriBuilder()
{
Scheme = "https",
Host = string.Format("{0}.blob.core.windows.net", storageAccount),
Path = string.Format("{0}", containerName),
};
_blobContainerClient = new BlobContainerClient(fullUri.Uri, managedIdentity);
_blobContainerClient.CreateIfNotExists();
...
public async Task<BlobMessage> GetBlobByNameAsync(string blobName)
{
Ensure.That(blobName).IsNotNullOrEmpty();
var blobClient = _blobContainerClient.GetBlobClient(blobName);
if (!blobClient.Exists())
{
_log.Error($"Blob {blobName} not found.");
throw new InfrastructureException($"Blob {blobName} not found.");
}
BlobDownloadInfo download = await blobClient.DownloadAsync();
return new BlobMessage
{
BlobName = blobClient.Name,
BaseStream = download.Content,
Content = await GetBlobContentAsync(download)
};
}
Thanks,
Vincent.
Based on the code you posted, I can suggest two improvements:
Receive 32 messages at a time instead of 1: Currently you're getting just one message at a time (var result = await queue.ReceiveMessagesAsync(1);). You can receive a maximum of 32 messages from the top of the queue. Just change the code to var result = await queue.ReceiveMessagesAsync(32); to get 32 messages. This will save you 31 trips to storage service and that should lead to some performance improvements.
Don't try to create blob container every time: Currently you're trying to create a blob container every time you process a message (_blobContainerClient.CreateIfNotExists();). It is really unnecessary. With fetching 32 messages, you're reducing this method call by 31 times however you can just move this code to your application startup so that you only call it once during your application lifecycle.

Is it possible to use more than 1.5 Gb of memory with an Azure Function App V2

I'm currently using v2 of Azure Function Apps. I've set the environment to be 64 bit and am compiling to .Net Standard 2.0. Host Json specifies version 2.
I'm reading in a .csv and it works fine for smaller files. But when I read in a 180MB .csv into a List of string[] it's ballooning to over a GB on read and when I try to parse it, it's up over 2 GB but then throws the 'Out of Memory' Exception. Even running on an app service plan with more than 3.5 GB hasn't solved the issue.
Edit:
I'm using this:
Uri blobUri = AppendSasOnUri(blobName); _webClient = new WebClient();
Stream sourceStream = _webClient.OpenRead(blobUri);
_reader = new StreamReader(sourceStream);
However, since It's a csv, I'm splitting out entire columns of data. It's pretty hard to get away from this:
internal async Task<List<string[]>> ReadCsvAsync() {
while (!_reader.EndOfStream) {
string[] currentCsvRow = await ReadCsvRowAsync();
_fullBlobCsv.Add(currentCsvRow);
}
return _fullBlobCsv; }
Goal is to store json into blob when alls said and done.
Try using stream (StreamReader) to read the input .csv file and process one line at a time.
I'm able to parse 300mb files on consumption plan with streams. My use-case may not be same but similar. Parse a large concatenated pdf file and separate it to 5000+ smaller files and store the separated files into blob container. Below is my code for reference.
For your use case you may want to use CloudAppendBlob instead of CloudBlockBlob if you're pushing all parsed data into single blob.
public async static void ExtractSmallerFiles(CloudBlockBlob myBlob, string fileDate, ILogger log)
{
using (var reader = new StreamReader(await myBlob.OpenReadAsync()))
{
CloudBlockBlob blockBlob = null;
var fileContents = new StringBuilder(string.Empty);
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.StartsWith("%%MS_SKEY_0000_000_PDF:"))
{
var matches = Regex.Match(line, #"%%MS_SKEY_0000_000_PDF: A(\d+)_SMFL_B1234_D(\d{8})_A\d+_M(\d{15}) _N\d+");
var smallFileDate = matches.Groups[2];
var accountNumber = matches.Groups[3];
var fileName = $"SmallerFiles/{smallFileDate}/{accountNumber}.pdf";
blockBlob = myBlob.Container.GetBlockBlobReference(fileName);
}
fileContents.AppendLine(line);
if (line.Equals("%%EOF"))
{
log.LogInformation($"Uploading {fileContents.Length} bytes to {blockBlob.Name}");
await blockBlob.UploadTextAsync(fileContents.ToString());
fileContents = new StringBuilder(string.Empty);
}
}
await myBlob.DeleteAsync();
log.LogInformation("Extracted Smaller files");
}
}

Getting an error when uploading a file to Azure Storage

I'm converting a website from a standard ASP.NET website over to use Azure. The website had previously taken an Excel file uploaded by an administrative user and saved it on the file system. As part of the migration, I'm saving this file to Azure Storage. It works fine when running against my local storage through the Azure SDK. (I'm using version 1.3 since I didn't want to upgrade during the development process.)
When I point the code to run against Azure Storage itself, though, the process usually fails. The error I get is:
System.IO.IOException occurred
Message=Unable to read data from the transport connection: The connection was closed.
Source=Microsoft.WindowsAzure.StorageClient
StackTrace:
at Microsoft.WindowsAzure.StorageClient.Tasks.Task`1.get_Result()
at Microsoft.WindowsAzure.StorageClient.Tasks.Task`1.ExecuteAndWait()
at Microsoft.WindowsAzure.StorageClient.CloudBlob.UploadFromStream(Stream source, BlobRequestOptions options)
at Framework.Common.AzureBlobInteraction.UploadToBlob(Stream stream, String BlobContainerName, String fileName, String contentType) in C:\Development\RateSolution2010\Framework.Common\AzureBlobInteraction.cs:line 95
InnerException:
The code is as follows:
public void UploadToBlob(Stream stream, string BlobContainerName, string fileName,
string contentType)
{
// Setup the connection to Windows Azure Storage
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(GetConnStr());
DiagnosticMonitorConfiguration dmc = DiagnosticMonitor.GetDefaultInitialConfiguration();
dmc.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(1);
dmc.Logs.ScheduledTransferLogLevelFilter = LogLevel.Verbose;
DiagnosticMonitor.Start(storageAccount, dmc);
CloudBlobClient BlobClient = null;
CloudBlobContainer BlobContainer = null;
BlobClient = storageAccount.CreateCloudBlobClient();
// For large file copies you need to set up a custom timeout period
// and using parallel settings appears to spread the copy across multiple threads
// if you have big bandwidth you can increase the thread number below
// because Azure accepts blobs broken into blocks in any order of arrival.
BlobClient.Timeout = new System.TimeSpan(1, 0, 0);
Role serviceRole = RoleEnvironment.Roles.Where(s => s.Value.Name == "OnlineRates.Web").First().Value;
BlobClient.ParallelOperationThreadCount = serviceRole.Instances.Count;
// Get and create the container
BlobContainer = BlobClient.GetContainerReference(BlobContainerName);
BlobContainer.CreateIfNotExist();
//delete prior version if one exists
BlobRequestOptions options = new BlobRequestOptions();
options.DeleteSnapshotsOption = DeleteSnapshotsOption.None;
CloudBlob blobToDelete = BlobContainer.GetBlobReference(fileName);
Trace.WriteLine("Blob " + fileName + " deleted to be replaced by newer version.");
blobToDelete.DeleteIfExists(options);
//set stream to starting position
stream.Position = 0;
long totalBytes = 0;
//Open the stream and read it back.
using (stream)
{
// Create the Blob and upload the file
CloudBlockBlob blob = BlobContainer.GetBlockBlobReference(fileName);
try
{
BlobClient.ResponseReceived += new EventHandler<ResponseReceivedEventArgs>((obj, responseReceivedEventArgs)
=>
{
if (responseReceivedEventArgs.RequestUri.ToString().Contains("comp=block&blockid"))
{
totalBytes += Int64.Parse(responseReceivedEventArgs.RequestHeaders["Content-Length"]);
}
});
blob.UploadFromStream(stream);
// Set the metadata into the blob
blob.Metadata["FileName"] = fileName;
blob.SetMetadata();
// Set the properties
blob.Properties.ContentType = contentType;
blob.SetProperties();
}
catch (Exception exc)
{
Logging.ExceptionLogger.LogEx(exc);
}
}
}
I've tried a number of different alterations to the code: deleting a blob before replacing it (although the problem exists on new blobs as well), setting container permissions, not setting permissions, etc.
Your code looks like it should work, but it has lots of extra functionality that is not strictly required. I would cut it down to an absolute minimum and go from there. It's really only a gut feeling, but I think it might be the using statement giving you grief. This enture function could be written (presuming the container already exists) as:
public void UploadToBlob(Stream stream, string BlobContainerName, string fileName,
string contentType)
{
// Setup the connection to Windows Azure Storage
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(GetConnStr());
CloudBlobClient BlobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer BlobContainer = BlobClient.GetContainerReference(BlobContainerName);
CloudBlockBlob blob = BlobContainer.GetBlockBlobReference(fileName);
stream.Position = 0;
blob.UploadFromStream(stream);
}
Notes on the stuff that I've removed:
You should set up diagnostics just once when you're app starts, not every time a method is called. Usually in the RoleEntryPoint.OnStart()
I'm not sure why you're trying to set ParallelOperationThreadCount higher if you have more instances. Those two things seem unrelated.
It's not good form to check for the existence of a container/table every time you save something to it. It's more usual to do that check once when your app starts or to have a process external to the website to make sure all the required containers/tables/queues exist. Of course if you're trying to dynamically create containers this is not true.
The problem turned out to be firewall settings on my laptop. It's my personal laptop originally set up at home and so the firewall rules weren't set up for a corporate environment resulting in slow performance on uploads and downloads.

Resources