Limits on File Count for Azure Blob Storage - azure

Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.

Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.

I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage

Related

Is rehydration of the (Azure Blob Storage) archive tier always needed?

I have studied the following link to understand the Hot, Cool and Archive tiers of Azure Storage V2.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
In the Blob rehydration section it says:
To read data in archive storage, you must first change the tier of the blob to hot or cool. This process is known as rehydration and can take up to 15 hours to complete.
My questions are:
Can I get just list of all blobs without rehydration? Is it going to cost me?
Do I have to perform rehydration before reading/deleting a single file?
Do I have to perform rehydration to delete a file before 180 days?
All answers are taken from the article you linked to:
1) Yes, you can get a list and it will not cost you extra
2) Yes, you have to rehydrate to read file contents, but you can delete without rehydrating
While a blob is in archive storage, the blob data is offline and cannot be read, copied, overwritten, or modified. You can't take snapshots of a blob in archive storage. However, the blob metadata remains online and available, allowing you to list the blob and its properties. For blobs in archive, the only valid operations are GetBlobProperties, GetBlobMetadata, ListBlobs, SetBlobTier, and DeleteBlob.
As an addition to the answer to the reading part of question 2):
Blob-level tiering allows you to change the tier of your data at the object level using a single operation called Set Blob Tier. You can easily change the access tier of a blob among the hot, cool, or archive tiers as usage patterns change, without having to move data between accounts. All tier changes happen immediately. However, rehydrating a blob from archive can take several hours.
3) The 180 days are the minimum amount of time a blob needs to be in archive storage. Changes before that period incur an early deletion charge. This does not change the way you delete blobs, so you can still call DeleteBlob (and be charged the early deletion charge).
Any blob that is deleted or moved out of the cool (GPv2 accounts only) or archive tier before 30 days and 180 days respectively will incur a prorated early deletion charge.

Is Azure Blob storage the right place to store many (small) communication logs?

I am working with a program which connects to multiple APIs, the logs for each operation (HTML/XML/Json) need to be stored for possible later review. Is it feasible to store each request/reply in an Azure blob? There can be hundreds of requests per second (all of which need storing) which vary in size and have an average size of 100kB.
Because the logs need to be searchable (by metadata) my plan is to store it in Azure Blob and put metadata (with blob locations, custom application-related request and content identifiers, etc) in an easily-searchable database.
You can store logs in the Azure table storage or Blob storage but Microsoft itself recommends using Blob storage. Azure Storage Analytics stores log data in Blob storage.
This 'Azure Storage Table Design Guide' points out several draw backs of using table storage for logs and also provides details on how to use the blob storage to store logs. Read the 'Log data anti-pattern' section in particular for this use case.

How to archive Azure blob storage content?

I'm need to store some temporary files may be 1 to 3 months. Only need to keep the last three months files. Old files need to be deleted. How can I do this in azure blob storage? Is there any other option in this case other than blob storage?
IMHO best option to store files in Azure is either Blob Storage or File Storage however both of them don't support auto expiration of content (based on age or some other criteria).
This feature has been requested long back for Blobs Storage but unfortunately no progress has been made so far (https://feedback.azure.com/forums/217298-storage/suggestions/7010724-support-expiration-auto-deletion-of-blobs).
You could however write something of your own to achieve this. It's rather very simple: Periodically (say once in a day) your program will fetch the list of blobs and compare the last modified date of the blob with current date. If the last modified date of the blob is older than the desired period (1 or 3 months like you mentioned), you simply delete the blob.
You can use WebJobs, Azure Functions or Azure Automation to schedule your code to run on a periodic basis. In fact, there's readymade code available to you if you want to use Azure Automation Service: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
As I know, Azure Blob is a appropriate approach for you to storage some temporary files. For your scenario, I assumed that there is no in-build option for you to delete the old files, and you need to programmatically or manually delete your temporary files.
For a simple way, you could try to upload your blob(file) with the specific format (e.g. https://<your-storagename>.blob.core.windows.net/containerName/2016-11/fileName or https://<your-storagename>.blob.core.windows.net/2016-11/fileName), then you could manually manage your files via Microsoft Azure Storage Explorer.
Also, you could check your files and delete the old files before you uploading the new temporary file. For more details, you could follow storage-blob-dotnet-store-temp-files and override the method CleanStorageIfReachLimit to implement your logic for deleting blobs(files).
Additionally, you could leverage a scheduled Azure WebJob to clean your blobs(files).
You can use Azure Cool Blob Storage.
It is cheaper than Blob storage and is more suitable for archives.
You can store your less frequently accessed data in the Cool access tier at a low storage cost (as low as $0.01 per GB in some regions), and your more frequently accessed data in the Hot access tier at a lower access cost.
Here is a document that explains its features:
https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage/

Images in Azure storage

Where would it be appropriate to store structured photos in Azure storage? There a ton (millions) of photos and they are currently sitting in folders locally.
I originally looked at blob storage to hold them, but that is for unstructured data; then I looked at table storage, but I'm not sure if the file size is too large for the entity. Also looked at file storage, but it seems like that's only in preview upon request.
Blob Storage is the way to go. It is meant for that purpose only - storing files in the cloud. Some additional reasons:
Today, each storage account can hold 500 TB of data so if you're storing images only in your storage account, you can store up to 500 TB of data.
3 copies of each item (file in your case) is maintained in the region. If you enable Geo-Replication on your storage (GRS), 3 additional copies are maintained in a secondary region which is at least 400 miles away from the primary reason. So it would be a good strategy for disaster recovery purposes.
As it is cloud storage solution, you only pay for the storage space you occupy. So for example if you are storing only 15 GB data, you will only pay for 15 GB.
Table Storage is mainly intended for storing structured/semi-structured data in key/value pair format. Further, size of each item (known as Entity in table storage lingo) can be of a maximum 1 MB in size. Each item in blob storage can be of a maximum of 200 GB in size.
A few other things to consider:
Blob storage is a 2 level storage: Container and Blob. Think of a container as a folder on your computer and blob as a file. Unlike local storage, you can't have nested folders in blob storage.
Even though blob storage doesn't support nested folder hierarchy, you can create an illusion of nested folder hierarchy by something called blob prefix. To give you an example, let's say you have images folder and inside that folder the image files are grouped by year (2014, 2015 etc.). So in this case, you can create a container called images. Now when it comes to saving files (say C:\images\2014\image1.png), you can prefix the folder path so your image will be saved as 2014/image1.png in the container images.
You can make use of some available storage explorers for uploading purposes. Most of the storage explorers support preserving the folder hierarchy.

best design solution to migrate data from SQL Azure to Azure Table

In our service, we are using SQL Azure as the main storage, and Azure table for the backup storage. Everyday about 30GB data is collected and stored to SQL Azure. Since the data is no longer valid from the next day, we want to migrate the data from SQL Azure to Azure table every night.
The question is.. what would be the most efficient way to migrate data from Azure to Azure table?
The naive idea i came up with is to leverage the producer/consumer concept by using IDataReader. That is, first get a data reader by executing "select * from TABLE" and put data into a queue. At the same time, a set of threads are working to grab data from the queue, and insert them into Azure Table.
Of course, the main disadvantage of this approach (i think) is that we need to maintain the opened connection for a long time (might be several hours).
Another approach is to first copy data from SQL Azure table to local storage on Windows Azure, and use the same producer/consumer concept. In this approach we can disconnect the connection as soon as the copy is done.
At this point, i'm not sure which one is better, or even either of them is a good design to implement. Could you suggest any good design solution for this problem?
Thanks!
I would not recommend using local storage primarily because
It is transient storage.
You're limited by the size of local storage (which in turn depends on the size of the VM).
Local storage is local only i.e. it is accessible only to the VM in which it is created thus preventing you from scaling out your solution.
I like the idea of using queues, however I see some issues there as well:
Assuming you're planning on storing each row in a queue as a message, you would be performing a lot of storage transactions. If we assume that your row size is 64KB, to store 30 GB of data you would be doing about 500000 write transactions (and similarly 500000 read transactions) - I hope I got my math right :). Even though the storage transactions are cheap, I still think you'll be doing a lot of transactions which would slow down the entire process.
Since you're doing so many transactions, you may get hit by storage thresholds. You may want to check into that.
Yet another limitation is the maximum size of a message. Currently a maximum of 64KB of data can be stored in a single message. What would happen if your row size is more than that?
I would actually recommend throwing blob storage in the mix. What you could do is read a chunk of data from SQL table (say 10000 or 100000 records) and save that data in blob storage as a file. Depending on how you want to put the data in table storage, you could store the data in CSV, JSON or XML format (XML format for preserving data types if it is needed). Once the file is written in blob storage, you could write a message in the queue. The message will contain the URI of the blob you've just written. Your worker role (processor) will continuously poll this queue, get one message, fetch the file from blob storage and process that file. Once the worker role has processed the file, you could simply delete that file and the message.

Resources