Where would it be appropriate to store structured photos in Azure storage? There a ton (millions) of photos and they are currently sitting in folders locally.
I originally looked at blob storage to hold them, but that is for unstructured data; then I looked at table storage, but I'm not sure if the file size is too large for the entity. Also looked at file storage, but it seems like that's only in preview upon request.
Blob Storage is the way to go. It is meant for that purpose only - storing files in the cloud. Some additional reasons:
Today, each storage account can hold 500 TB of data so if you're storing images only in your storage account, you can store up to 500 TB of data.
3 copies of each item (file in your case) is maintained in the region. If you enable Geo-Replication on your storage (GRS), 3 additional copies are maintained in a secondary region which is at least 400 miles away from the primary reason. So it would be a good strategy for disaster recovery purposes.
As it is cloud storage solution, you only pay for the storage space you occupy. So for example if you are storing only 15 GB data, you will only pay for 15 GB.
Table Storage is mainly intended for storing structured/semi-structured data in key/value pair format. Further, size of each item (known as Entity in table storage lingo) can be of a maximum 1 MB in size. Each item in blob storage can be of a maximum of 200 GB in size.
A few other things to consider:
Blob storage is a 2 level storage: Container and Blob. Think of a container as a folder on your computer and blob as a file. Unlike local storage, you can't have nested folders in blob storage.
Even though blob storage doesn't support nested folder hierarchy, you can create an illusion of nested folder hierarchy by something called blob prefix. To give you an example, let's say you have images folder and inside that folder the image files are grouped by year (2014, 2015 etc.). So in this case, you can create a container called images. Now when it comes to saving files (say C:\images\2014\image1.png), you can prefix the folder path so your image will be saved as 2014/image1.png in the container images.
You can make use of some available storage explorers for uploading purposes. Most of the storage explorers support preserving the folder hierarchy.
Related
I am trying to figure out what is the benefit of Index Tags vs Creating a full Virtual Folder tree structure in azure blob storage, when i have full programatic control over the creation of the blobs.
Virtual Folder structure vs Blob Index Tags
You're asking us to compare just two separate features of Azure Blob Storage as though they were mutually exclusive, when in-fact they can be used together, and there are more options for organizing blobs than just those 2 options:
TL;DR:
Azure Blob Index Tags - arbitrary mutable tags on your blobs.
Virtual folder structure - this is just a naming convention where your blobs are named with slash-separated "directory" names.
NFS 3.0 Blob Storage and Data Lake Storage Gen2 - this is a major new version (or revision) of Azure Blob Storage that makes it behave almost exactly like a traditional disk file-system (hence the NFS 3.0-compliance) however it (currently) comes with major shortcomings.
In detail:
Azure Blob Index Tags is a recently introduced new feature to Azure Blob Storage: it entered preview in May 2020 and left the preview-stage in June 2021 (2 months ago at the time of writing).
Your storage account needs to be "General Purpose v2" - so if you have a an older-style storage account you'll need to update it.
Advantages:
It's built-in to Azure Blob Storage, so you don't need to maintain your own indexing infrastructure (which is what we used to have to do: I stored my own blob index in a table in Azure Table Storage in the same storage account, and had a process that ran on a disposable Azure VM nightly to index new blobs).
As it's a tagging system it means you can have your own taxonomy and don't have to force your nomenclature into a single hierarchy as with virtual folders.
Tags are mutable: you can add/remove/edit them as you like.
Disadvantages:
As with maintaining your own blob index the index updates are not instantaneous (unlike compared to an RDBMS where indexes are always up-to-date). The blog article linked handwaves this away by saying:
and the account indexing engine exposes the new blob index shortly after."
...note that they don't define what "shortly" means.
As of August 2021, Azure charges $0.03 per 10,000 tags (regardless of the storage-tier in use). So if you have 1,000,000 blobs and 3 tags per blob, then that's $9/mo.
This isn't a significant cost by any means, but the cost-per-information-theoretic-unit is kinda-high, which is disappointing.
"Virtual Folder tree structure" - By this I assume you mean giving your blob's hierarchical naming system and using Azure Blob Storage's blob-name-prefix search filter.
Advantages:
Tried-and-tested. Simple.
Doesn't cost you anything.
No indexing delay.
Disadvantages:
It's still as slow as enumerating blobs lexicographically.
You cannot conceptually move or rename blobs.
(You can, technically, provided source and destination are in the same container by doing a copy+delete, and the copy operation should be instantaneous as I understand that Blob Storage uses COW for same-container copies, but it's still imperfect: the client API still exposes it as an asynchronous operation with an unbounded time-to-copy rather than giving hard guarantees)
The fact this has been a limitation of Azure Blob Storage for a decade now utterly confounds me.
NFS 3.0 Blob Storage - Also new in 2020/2021 with Blob Index Tags is NFS 3.0 Blob Storage, which gives you a full "real" hierarchical filesystem for your blobs.
The Hierarchical Namespace feature is powered by Azure Data Lake Storage Gen 2. I don't know any technical details of this so I can't say anything.
Advantages:
NFS 3.0-compliant (that's huge!) so Linux clients can even mount it directly.
It's cheaper than normal blob storage (whaaaaat?!):
In West US 2, NFS+LRS+Hot is $0.018/GB while the old-school flat namespace with LRS+Hot is $0.0184/GB.
In other Azure locations and with other redundancy options then NFS can be slightly more expensive, but otherwise they're generally within $0.01 of each other.
Disadvantages:
Apparently you're limited to only block-blobs: not page-blobs or append-blobs.
Notes from the Known Issues page:
NFS can only be used with new accounts: you cannot update an existing account. You also cannot disable it once you enable it.
You cannot (currently) lock blobs/files - though this looks to come in a future version.
You cannot use both Blob Index Tags and NFS in the same storage account - or in fact most features of Blob Storage (ooo-er!).
The documentation for operations exclusively to Hierarchical namespace blobs only lists Set Blob Expiry - there (still) doesn't seem to be a synchronous/atomic "move blob" or "rename blob" operation, instead the Protocol Support page implies that an operation to rename an NFS file will be translated into raw blob storage operations behind-the-scenes... so I'm curious how they do that atomically.
When your application makes a request by using the NFS 3.0 protocol, that request is translated into combination of block blob operations. For example, NFS 3.0 read Remote Procedure Call (RPC) requests are translated into Get Blob operation. NFS 3.0 write RPC requests are translated into a combination of Get Block List, Put Block, and Put Block List.
Alternative concept: Content-addressable-storage
Because blobs cannot be atomically/synchronously renamed so a few years ago I simply gave up trying to come up with a perfect blob nomenclature that would stand the test of time because business requirements always change.
Instead, I noticed that my blobs were invariably immutable: once they've been uploaded to storage they're never updated, or when they are updated they're saved to new, separate blobs - which means that a content-addressable naming strategy suited my projects perfectly.
In short: give your immutable blobs a name which is a string-representation of a hash of their content, and store their hashes in a traditional RDBMS where you have much greater flexibility (and ideally: performance) with how they're indexed and referenced by the rest of your system.
In my case, I set my blobs' names to the Base-16 representation of their SHA-256 hash.
Advantages:
You get de-duping for free: blobs with identical content will have identical hashes, so you can avoid uploading/downloading the same huge blob twice.
You get integrity checks for free: if you download a blob and its hash doesn't match its blob-name then your storage account likely got hacked)
Disadvantages:
You still need to maintain your own index in your RDBMS (if applicable) - but you can still use Blob Index Tags with content-addressable storage if you like.
I would like the stats of specific folders in Azure Blob Storage. For example I would like to know how many files are present in a folder, whats the size of each file or whats the total size of a folder. Does blob storage provide similar data through an api endpoint?
Edit: I have a very large number of files on Azure Blob so I am looking for a solution where I do not have to iterate over all the files in order to calculate total size of the virtual folder.
Does blob storage provide similar data through an api endpoint?
As such Azure Blob Storage does not provide an API to get storage statistics at the folder level but you can make use of List Blobs REST API operation to get that information.
List Blobs operation will list the blobs insider a container but you can use prefix parameter to get the list of blobs inside a virtual folder where prefix will be the path of the virtual folder. For example, if you wish to list the blobs inside folder1 virtual folder, you would specify prefix as folder1/
Each item in the list is a blob which will have a size attribute which will give you the size of the blob. You will simply add the size of individual blobs to get the total size of the folder.
Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.
I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage
I have found documentation describing the limits for blob storage, including the maximum file size and blob size, but I can't find reference to whether there is a limit to the number of files that can be stored - is there a limit, or perhaps more importantly, a performance penalty when there are several hundred thousand (or million) small files stored in blob storage?
There is no limit to the number of blobs in a storage account, aside from the 500TB limit per storage account. You won't see a performance difference when dealing with individual blobs, whether you have one blob or a million. Now, if you decide to list blobs in a container, and you have a million blobs in a container, you will certainly see a difference than listing a container with just a handful of blobs. But, with direct-access via blob name, nope: no perf difference at all.
I have groups of files using the following structure:
RandomFolderName1 [File1.jpg, File2.jpg, File3.jpg...]
RandomFolderName2 [File1.jpg, File2.jpg, File3.jpg...]
I wonder what will be the bast way to store this in Blob Storage.
Should I use GUID.jpg for every file name and manage the folder structure in the DB
Should I use FolderName+FileName.jpg, but again will have to manage the folder structure in DB
Should I use a Container for a folder and inside have File1.jpg, File2.jpg, File3.jpg...
Should I Store the whole ForderName as a zip and have all the files inside
Is there any other way to define a folder structure in Blob Storage?
Edit: The files will be accessed on a folder basis
So you can use file names in Azure blobs like "randomfoldername1/file1.jpg". It will look like a folder structure and some GUI clients will even let you navigate like it it. But the reality is that the "container" is the only real grouping factor and from there its just a matter of filterng the files in that container based on partial file names.
So to answer your question, you'll likely be fine putting all the files into a single container. The containers help control acces policy an each blob has its own performance target. The aside from acl reasons, the only other reason to split them across blobs in the same container is if you have enough blobs that quering them starts to degrade due to the shere number (or you're exceeding the storage account throughput targets).
You can find out more about Azure Storage abstractions and throughput targets at: http://www.windows-azure.net/windows-azure-storage-abstractions-and-their-scalability-targets/