How to upload large files using AppendBlob by chunking? - azure

I am able to upload the file in the Azure blob storage using AppendBlob but I would like to know how should I achieve the same in lesser time so that I can upload the large size (32 GB) size . The logic of UploadFile in appendblob is as follows :
public static async Task<Response> AppendFile(CloudAppendBlob cloudAppendBlob, byte[] fileData, int index, int count)
{
await cloudAppendBlob.AppendFromByteArrayAsync(fileData, index, count);
//var res = cloudBlockBlob;
Response response = new();
response.fileUrl = cloudAppendBlob.Uri.AbsoluteUri;
response.length = cloudAppendBlob.Properties.Length;
if (cloudAppendBlob.Metadata.TryGetValue("sessionId", out string session))
{
response.sessionId = session;
}
return response;
}
This is taking longer time for uploading the 32 GB File .Is there any other way we can achieve the upload and download of large files in lesser time ?

Related

Upload large file using azure java sdk more than 50k block

I'm trying to upload a file size of 230GB into azure block blob with the following code
private void uploadFile(FileObject srcFile, FileObject destFile) throws Exception {
try {
BlobClient destBlobClient = blobContainerClient.getBlobClient("destFilename");
long blockSize = 4 * 1024 * 1024 // 4 MB
ParallelTransferOptions opts = new ParallelTransferOptions()
.setBlockSizeLong(blockSize)
.setMaxConcurrency(5);
BlobRequestConditions requestConditions = new BlobRequestConditions();
try (BlobOutputStream bos = destBlobClient.getBlockBlobClient().getBlobOutputStream(
opts, null, null, null, requestConditions);
InputStream is = srcFile.getContent().getInputStream()) {
byte[] buffer = new byte[(int) blockSize];
int i = 0;
for (int len; (len = is.read(buffer)) != -1; ) {
bos.write(buffer, 0, len);
}
}
}
finally {
destFile.close();
srcFile.close();
}
}
Since,I am explicitly setting block size 4MB for each write operation I'm in a assumption that each write block is considered as single block in azure. But which is not the case.
For the above example 230GB file the write operation was executed 58880 times and the file got uploaded successfully.
Can someone please explain me more about how blocks are splits internally in azure and let help me to understand better.
Thanks in advance

Azure Block Blob: "The specified block list is invalid." Microsoft.Azure.Storage.StorageException when compressing files > 2GB between Blobs

The issue happens when I upload a file to one blob (Blob1) which in turn runs a background compression service. The background service streams the file from Blob1, compresses it, and stores it as a zip file in a separate blob (Blob2) to cache for the user to download.
The process works without issue for files < 2GB, but throws a Micrososft.Azure.Storage.StorageException when the file size is > 2GB.
using Microsoft.Azure.Storage.Blob 11.2.2
Sample Code
public async void DoWork(CancellationToken cancellationToken)
{
while (!cancellationToken.IsCancellationRequested)
{
await _messageSemaphore.WaitAsync(cancellationToken);
MyModel model = await _queue.PeekMessage();
if(model != null)
{
try
{
//Get CloudBlockBlob zip blob reference
var zipBlockBlob = await _storageAccount.GetFileBlobReference(_configuration[ConfigKeys.ContainerName], model.Filename, model.FileRelativePath);
using (var zipStream = zipBlockBlob.OpenWrite()) //Opens zipstream
{
using (var archive = new ZipArchive(zipStream, ZipArchiveMode.Create, false)) // Create new ZipArchive
{
//Add each file to the zip archive
foreach (var fileUri in Files)
{
var file = new Uri(fileUri);
var cloudBlockBlob = blobContainer.GetBlockBlobReference(file);
using (var blobStream = cloudBlockBlob.OpenRead())//Opens read stream
{
//Create new ZipEntry
var zipEntry = archive.CreateEntry(model.Filename, CompressionLevel.Fastest);
using (var zipEntryStream = zipEntry.Open())
{
//Zip file
blobStream.CopyTo(zipEntryStream);
}
}
}
}
}
}
catch (FileNotFoundException e)
{
Console.WriteLine($"Download Error: {e.Message}");
}
catch (Microsoft.Azure.Storage.StorageException e) // "
{
Console.WriteLine($"Storage Exception Error: {e.Message}");
}
}
else
{
_messageSemaphore.Release();
// Wait for 1 minute between polls when idle
await Task.Delay(_sleepTime);
}
}
}
The problem was with concurrency and having multiple application instances writing to the same blob at the same time. To get around this I ended up implementing an Azure blob lease on the blob being created but had to create a temp file to apply the lease to. This is a decent enough work around for now, but this could / should probably be implemented using an Azure event driven service.

Append to CloudBlockBlob stream

We have a file system abstraction that allows us to easily switch between local and cloud (Azure) storage.
For reading and writing files we have the following members:
Stream OpenRead();
Stream OpenWrite();
Part of our application "bundles" documents into one file. For our local storage provider OpenWrite returns an appendable stream:
public Stream OpenWrite()
{
return new FileStream(fileInfo.FullName, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite, BufferSize, useAsync: true);
}
For Azure blob storage we do the following:
public Stream OpenWrite()
{
return blob.OpenWrite();
}
Unfortunately this overrides the blob contents each time. Is it possible to return a writable stream that can be appended to?
Based on the documentation for OpenWrite here http://msdn.microsoft.com/en-us/library/microsoft.windowsazure.storage.blob.cloudblockblob.openwrite.aspx, The OpenWrite method will overwrite an existing blob unless explicitly prevented using the accessCondition parameter.
One thing you could do is read the blob data in a stream and return that stream to your calling application and let that application append data to that stream. For example, see the code below:
static void BlobStreamTest()
{
storageAccount = CloudStorageAccount.DevelopmentStorageAccount;
CloudBlobContainer container = storageAccount.CreateCloudBlobClient().GetContainerReference("temp");
container.CreateIfNotExists();
CloudBlockBlob blob = container.GetBlockBlobReference("test.txt");
blob.UploadFromStream(new MemoryStream());//Let's just create an empty blob for the sake of demonstration.
for (int i = 0; i < 10; i++)
{
try
{
using (MemoryStream ms = new MemoryStream())
{
blob.DownloadToStream(ms);//Read blob data in a stream.
byte[] dataToWrite = Encoding.UTF8.GetBytes("This is line # " + (i + 1) + "\r\n");
ms.Write(dataToWrite, 0, dataToWrite.Length);
ms.Position = 0;
blob.UploadFromStream(ms);
}
}
catch (StorageException excep)
{
if (excep.RequestInformation.HttpStatusCode != 404)
{
throw;
}
}
}
}
There is now a CloudAppendBlob class that allows you to add content to an existing blob :
var account = CloudStorageAccount.Parse("storage account connectionstring");
var client = account.CreateCloudBlobClient();
var container = client.GetContainerReference("container name");
var blob = container.GetAppendBlobReference("blob name");
In your case you want to append from a stream:
await blob.AppendFromStreamAsync(new MemoryStream());
But you can append from text, byte array, file. Check the documentation.

Regarding CloudBlockblob.putBlock and CloudBlockBlob.PutBlockList

I am aware that we can use CloudBlockblob.putBlock and CloudBlockBlob.PutBlockList to upload in chunks but these methods do not have lease id parameter.
For this can i form the httpwebrequest with header "x-ms-lease-id" and attach to CloudBlockblob.putBlock and CloudBlockBlob.PutBlockList
Hi Gaurav,I could not big comment to your response hence adding it.
I tried with BlobRequest.PutBlock and Blobrequest.PutBlock with following code:
`for (int idxThread = 0; idxThread < numThreads; idxThread++)
{
tasks.Add(Task.Factory.StartNew(() =>
{
KeyValuePair blockIdAndLength;
while (true)
{
lock (queue)
{
if (queue.Count == 0)
break;
blockIdAndLength = queue.Dequeue();
}
byte[] buff = new byte[blockIdAndLength.Value];
//copying chunks into buff from inputbyte array
Array.Copy(buffer, blockIdAndLength.Key * (long)blockIdAndLength.Value, buff, 0, blockIdAndLength.Value);
// Upload block.
string blockName = Convert.ToBase64String(BitConverter.GetBytes(
blockIdAndLength.Key));
//string blockIdString = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}", blockIdAndLength.Key.ToString("0000000"))));
/// For small files like 100 KB it works files,for large files like 10 MB,it will end up uploading only 2-3 MB
/// //Is there any better way to implement Uploading in chunks and leasing.
///
string url = blob.Uri.ToString();
if (blob.ServiceClient.Credentials.NeedsTransformUri)
{
url = blob.ServiceClient.Credentials.TransformUri(url);
}
var req = BlobRequest.Put(new Uri(url), 90, new BlobProperties(), BlobType.BlockBlob, leaseId, 0);
using (Stream writer = req.GetRequestStream())
{
writer.Write(buff,0,buff.Length);
}
blob.ServiceClient.Credentials.SignRequest(req);
req.GetResponse().Close();
}
}));
}
// Wait for all threads to complete uploading data.
Task.WaitAll(tasks.ToArray());`
This does not work for multiple chunks..Could you please provide your inputs
I don't think you can. However take a look at BlobRequest class in Microsoft.WindowsAzure.StorageClient.Protocol namespace. It has PutBlock and PutBlockList functions which allows you to specify LeaseId.
Hope this helps.

Getting blob count in an Azure Storage container

What is the most efficient way to get the count on the number of blobs in an Azure Storage container?
Right now I can't think of any way other than the code below:
CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs().Count();
If you just want to know how many blobs are in a container without writing code you can use the Microsoft Azure Storage Explorer application.
Open the desired BlobContainer
Click the Folder Statistics icon
Observe the count of blobs in the Activities window
I tried counting blobs using ListBlobs() and for a container with about 400,000 items, it took me well over 5 minutes.
If you have complete control over the container (that is, you control when writes occur), you could cache the size information in the container metadata and update it every time an item gets removed or inserted. Here is a piece of code that would return the container blob count:
static int CountBlobs(string storageAccount, string containerId)
{
CloudStorageAccount cloudStorageAccount = CloudStorageAccount.Parse(storageAccount);
CloudBlobClient blobClient = cloudStorageAccount.CreateCloudBlobClient();
CloudBlobContainer cloudBlobContainer = blobClient.GetContainerReference(containerId);
cloudBlobContainer.FetchAttributes();
string count = cloudBlobContainer.Metadata["ItemCount"];
string countUpdateTime = cloudBlobContainer.Metadata["CountUpdateTime"];
bool recountNeeded = false;
if (String.IsNullOrEmpty(count) || String.IsNullOrEmpty(countUpdateTime))
{
recountNeeded = true;
}
else
{
DateTime dateTime = new DateTime(long.Parse(countUpdateTime));
// Are we close to the last modified time?
if (Math.Abs(dateTime.Subtract(cloudBlobContainer.Properties.LastModifiedUtc).TotalSeconds) > 5) {
recountNeeded = true;
}
}
int blobCount;
if (recountNeeded)
{
blobCount = 0;
BlobRequestOptions options = new BlobRequestOptions();
options.BlobListingDetails = BlobListingDetails.Metadata;
foreach (IListBlobItem item in cloudBlobContainer.ListBlobs(options))
{
blobCount++;
}
cloudBlobContainer.Metadata.Set("ItemCount", blobCount.ToString());
cloudBlobContainer.Metadata.Set("CountUpdateTime", DateTime.Now.Ticks.ToString());
cloudBlobContainer.SetMetadata();
}
else
{
blobCount = int.Parse(count);
}
return blobCount;
}
This, of course, assumes that you update ItemCount/CountUpdateTime every time the container is modified. CountUpdateTime is a heuristic safeguard (if the container did get modified without someone updating CountUpdateTime, this will force a re-count) but it's not reliable.
The API doesn't contain a container count method or property, so you'd need to do something like what you posted. However, you'll need to deal with NextMarker if you exceed 5,000 items returned (or if you specify max # to return and the list exceeds that number). Then you'll make add'l calls based on NextMarker and add the counts.
EDIT: Per smarx: the SDK should take care of NextMarker for you. You'll need to deal with NextMarker if you're working at the API level, calling List Blobs through REST.
Alternatively, if you're controlling the blob insertions/deletions (through a wcf service, for example), you can use the blob container's metadata area to store a cached container count that you compute with each insert or delete. You'll just need to deal with write concurrency to the container.
Example using PHP API and getNextMarker.
Counts total number of blobs in an Azure container.
It takes a long time: about 30 seconds for 100000 blobs.
(assumes we have a valid $connectionString and a $container_name)
$blobRestProxy = ServicesBuilder::getInstance()->createBlobService($connectionString);
$opts = new ListBlobsOptions();
$nblobs = 0;
while($cont) {
$blob_list = $blobRestProxy->listBlobs($container_name, $opts);
$nblobs += count($blob_list->getBlobs());
$nextMarker = $blob_list->getNextMarker();
if (!$nextMarker || strlen($nextMarker) == 0) $cont = false;
else $opts->setMarker($nextMarker);
}
echo $nblobs;
If you are not using virtual directories, the following will work as previously answered.
CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs().Count();
However, the above code snippet may not have the desired count if you are using virtual directories.
For instance, if your blobs are stored similar to the following: /container/directory/filename.txt where the blob name = directory/filename.txt the container.ListBlobs().Count(); will only count how many "/directory" virtual directories you have. If you want to list blobs contained within virtual directories, you need to set the useFlatBlobListing = true in the ListBlobs() call.
CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs(null, true).Count();
Note: the ListBlobs() call with useFlatBlobListing = true is a much more expensive/slow call...
Bearing in mind all the performance concerns from the other answers, here is a version for v12 of the Azure SDK leveraging IAsyncEnumerable. This requires a package reference to System.Linq.Async.
public async Task<int> GetBlobCount()
{
var container = await GetBlobContainerClient();
var blobsPaged = container.GetBlobsAsync();
return await blobsPaged
.AsAsyncEnumerable()
.CountAsync();
}
With Python API of Azure Storage it is like:
from azure.storage import *
blob_service = BlobService(account_name='myaccount', account_key='mykey')
blobs = blob_service.list_blobs('mycontainer')
len(blobs) #returns the number of blob in a container
If you are using Azure.Storage.Blobs library, you can use something like below:
public int GetBlobCount(string containerName)
{
int count = 0;
BlobContainerClient container = new BlobContainerClient(blobConnctionString, containerName);
container.GetBlobs().ToList().ForEach(blob => count++);
return count;
}
Another Python example, works slow but correctly with >5000 files:
from azure.storage.blob import BlobServiceClient
constr="Connection string"
container="Container name"
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs()
num = 0
size = 0
for blob in blobs_list:
num += 1
size += blob.size
print(blob.name,blob.size)
print("Count: ", num)
print("Size: ", size)
I have spend quite period of time to find the below solution - I don't want to some one like me to waste time - so replying here even after 9 years
package com.sai.koushik.gandikota.test.app;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.blob.*;
public class AzureBlobStorageUtils {
public static void main(String[] args) throws Exception {
AzureBlobStorageUtils getCount = new AzureBlobStorageUtils();
String storageConn = "<StorageAccountConnection>";
String blobContainerName = "<containerName>";
String subContainer = "<subContainerName>";
Integer fileContainerCount = getCount.getFileCountInSpecificBlobContainersSubContainer(storageConn,blobContainerName, subContainer);
System.out.println(fileContainerCount);
}
public Integer getFileCountInSpecificBlobContainersSubContainer(String storageConn, String blobContainerName, String subContainer) throws Exception {
try {
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConn);
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
CloudBlobContainer blobContainer = blobClient.getContainerReference(blobContainerName);
return ((CloudBlobDirectory) blobContainer.listBlobsSegmented().getResults().stream().filter(listBlobItem -> listBlobItem.getUri().toString().contains(subContainer)).findFirst().get()).listBlobsSegmented().getResults().size();
} catch (Exception e) {
throw new Exception(e.getMessage());
}
}
}
Count all blobs in a classic and new blob storage account. Building on #gandikota-saikoushik, this solution works for blob containers with a very large number of blobs.
//setup set values from Azure Portal
var accountName = "<ACCOUNTNAME>";
var accountKey = "<ACCOUTNKEY>";
var containerName = "<CONTAINTERNAME>";
uristr = $"DefaultEndpointsProtocol=https;AccountName={accountName};AccountKey={accountKey}";
var storageAccount = Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse(uristr);
var client = storageAccount.CreateCloudBlobClient();
var container = client.GetContainerReference(containerName);
BlobContinuationToken continuationToken = new BlobContinuationToken();
blobcount = CountBlobs(container, continuationToken).ConfigureAwait(false).GetAwaiter().GetResult();
Console.WriteLine($"blobcount:{blobcount}");
public static async Task<int> CountBlobs(CloudBlobContainer container, BlobContinuationToken currentToken)
{
BlobContinuationToken continuationToken = null;
var result = 0;
do
{
var response = await container.ListBlobsSegmentedAsync(continuationToken);
continuationToken = response.ContinuationToken;
result += response.Results.Count();
}
while (continuationToken != null);
return result;
}
List blobs approach is accurate but slow if you have millions of blobs. Another way that works in a few cases but is relatively fast is querying the MetricsHourPrimaryTransactionsBlob table. It is at the account level and metrics get aggregated hourly.
https://learn.microsoft.com/en-us/azure/storage/common/storage-analytics-metrics
You can use this
public static async Task<List<IListBlobItem>> ListBlobsAsync()
{
BlobContinuationToken continuationToken = null;
List<IListBlobItem> results = new List<IListBlobItem>();
do
{
CloudBlobContainer container = GetContainer("containerName");
var response = await container.ListBlobsSegmentedAsync(null,
true, BlobListingDetails.None, 5000, continuationToken, null, null);
continuationToken = response.ContinuationToken;
results.AddRange(response.Results);
} while (continuationToken != null);
return results;
}
and then call
var count = await ListBlobsAsync().Count;
hope it will be useful

Resources