Fast mechanism for querying Azure blob names - azure

I'm trying to get a list of blob names in Azure and I'm looking for ways to make this operation significantly faster. Within a given sub-folder, the number of blobs can exceed 150,000 elements. The filenames of the blobs are an encoded ID which is what I really need to get at, but I could store that as some sort of metadata if there was a way to query just the metadata or a single field of the metadata.
I'm finding that something as simple as the following:
var blobList = container.ListBlobs(null, false);
can take upwards of 60 seconds to run from my desktop and typically around 15 seconds when running on a VM hosted in Azure. These times are based on a test of 125k blobs in an otherwise empty container and were several hours after they were uploaded, so they've definitely had time to "settle", so to speak.
I've attempted multiple variations and tried using ListBlobsSegmented but it doesn't really help because the function is returning a lot of extra information that I simply don't need. I just need the blob names so I can get at the encoded ID to see what's currently stored and what isn't.
The query for the blob names and extracting the encoded Id is somewhat time sensitive so if I could get it to under 1 second, I'd be happy with it. If I stored the files locally, I can get the entire list of files in a few ms, but I have to use Azure storage for this so that's not an option.
The only thing I can think of to be able to reduce the time it takes to identify the available blobs is to track the names of the blobs being added or removed from a given folder and store it in a separate blob. Then when I need to know the blob names in that folder, I would read the blob with the metadata rather than using ListBlobs. I suppose another would be to use Azure Table storage in a similar way, but it seems like I'm being forced into caching information about a given folder in the container.
Is there a better way of doing this or is this generally what people end up doing when you have hundreds of thousands of blobs in a single folder?

As mentioned, Azure Blob storage is a storage system and doesn't help you in indexing the content. We now have Azure Search Indexer which indexes the content uploaded to Azure Blob storage, refer https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/ with this you can perform all the features supported by Azure Search e.g. listing, searching, paging, sorting etc.. Hope this helps.

Related

Best way to index data in Azure Blob Storage?

I plan on using Azure Blob storage to store images. I will have around 5000 categories for images that I plan on using folders to keep separated. For each of the image files, the file names won't differ a lot across the board and there is the potential to need to change metadata frequently.
My original plan was to use a SQL database to index all of these files and store my metadata there, but I'm second guessing that plan.
Is it feasible to index files in Azure Blob storage using a database, or should I just stick with using blob metadata?
Edit: I guess this question should really be "are there any downsides to indexing Azure Blob storage using a relational database?". I'm much more comfortable working with a DB than I am Azure storage, so my preference is to use a DB.
I'm second guessing whether or not to use a DB after looking at Azure storage more and discovering meta-tags and indexing. Hope this helps.
You can use Azure Search for this task as well, store images in Azure Storage (BLOB) and use Azure Search for crawling. indexing and searching. Using metadata you can enhance your search as well. This way you might not even need to use Folders to separate different categories.
Blob Index is a very feasible option and it can save the in the pricing, time, and overhead in terms of not using SQL.
https://azure.microsoft.com/en-gb/blog/manage-and-find-data-with-blob-index-for-azure-storage-now-in-preview/
If you are looking for more information on this preview feature, I would love hear more and work closer on this issue. Could you please reach me on BlobIndexPreview#microsoft.com.

Storing file created date in Azure File Storage

I am storing a series of Excel files in an Azure File Storage container for my company. My manager wants to be able to see the file created date for these files, as we will be running monthly downloads of the same reports. Is there a way to automate a means of storing the created date as one of the properties in Azure, or adding a bit of custom metadata, perhaps? Thanks in advance.
You can certainly store the created date as part of custom metadata for the file. However, there are certain things you would need to be aware of:
Metadata is editable: Anybody with access to the storage account can edit the metadata. They can change the created date metadata value or even delete that information.
Querying is painful: Azure File Storage doesn't provide querying capability so if you want to query on file's created date, it is going to be a painful process. First you would need to list all files in a share and then fetch metadata for each file separately. Depending on the number of files and the level of nesting, it could be a complicated process.
There are some alternatives available to you:
Use Blob Storage
If you can use Blob Storage instead of File Storage, use that. Blob Storage has a system defined property for created date so you don't have to do anything special. However like File Storage, Blob Storage also has an issue with querying but it is comparatively less painful.
Use Table Storage/SQL Database For Reporting
For querying purposes, you can store the file's created date in either Azure Table Storage or SQL Database. The downside of this approach is that because it is a completely separate system, it would be your responsibility to keep the data in sync. For example, if a file is deleted, you will need to ensure that entry for the same in the database is also removed.

How to archive Azure blob storage content?

I'm need to store some temporary files may be 1 to 3 months. Only need to keep the last three months files. Old files need to be deleted. How can I do this in azure blob storage? Is there any other option in this case other than blob storage?
IMHO best option to store files in Azure is either Blob Storage or File Storage however both of them don't support auto expiration of content (based on age or some other criteria).
This feature has been requested long back for Blobs Storage but unfortunately no progress has been made so far (https://feedback.azure.com/forums/217298-storage/suggestions/7010724-support-expiration-auto-deletion-of-blobs).
You could however write something of your own to achieve this. It's rather very simple: Periodically (say once in a day) your program will fetch the list of blobs and compare the last modified date of the blob with current date. If the last modified date of the blob is older than the desired period (1 or 3 months like you mentioned), you simply delete the blob.
You can use WebJobs, Azure Functions or Azure Automation to schedule your code to run on a periodic basis. In fact, there's readymade code available to you if you want to use Azure Automation Service: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
As I know, Azure Blob is a appropriate approach for you to storage some temporary files. For your scenario, I assumed that there is no in-build option for you to delete the old files, and you need to programmatically or manually delete your temporary files.
For a simple way, you could try to upload your blob(file) with the specific format (e.g. https://<your-storagename>.blob.core.windows.net/containerName/2016-11/fileName or https://<your-storagename>.blob.core.windows.net/2016-11/fileName), then you could manually manage your files via Microsoft Azure Storage Explorer.
Also, you could check your files and delete the old files before you uploading the new temporary file. For more details, you could follow storage-blob-dotnet-store-temp-files and override the method CleanStorageIfReachLimit to implement your logic for deleting blobs(files).
Additionally, you could leverage a scheduled Azure WebJob to clean your blobs(files).
You can use Azure Cool Blob Storage.
It is cheaper than Blob storage and is more suitable for archives.
You can store your less frequently accessed data in the Cool access tier at a low storage cost (as low as $0.01 per GB in some regions), and your more frequently accessed data in the Hot access tier at a lower access cost.
Here is a document that explains its features:
https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage/

Azure Data Factory Only Retrieve New Blob files from Blob Storage

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. One thing to note is that the files do not have a date time stamp. All files are present in a single blob container. New files are added to the same blob container. Do you know how to configure this?
I'd preface this answer with a change in your approach may be warranted...
Given what you've described your fairly limited on options. One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. You loop over all the items within the container and check if it has been processed yet.
The container has a ListBlobs method that would work for this. Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/
foreach (var item in container.ListBlobs(null, true))
{
// Check if it has already been processed or not
}
Note that the number of blobs in the container may be an issue with this approach. If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this.
Please use CloudBlobContainer.ListBlobs(null, true, BlobListingDetails.Metadata) and check CloudBlob.Properties.LastModified for each listed blob.
Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files.
However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time.
Can you explain your scenario further? Is there a reason you want to add data to the SQL tables every 15 minutes? Can you increase that to copy data every hour? Also, how is this data getting into Blob Storage? Is another Azure service putting it there or is it an external application? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage.
Another suggestion would be to create folders for the 15 minute intervals like hhmm. So, for example, a sample folder would be called '0515'. You could even have a parent folder for the year, month and day. This way you can insert the data into these folders in Blob Storage. Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders.
I hope this helps! If you can provide some more information about your problem, I'd be happy to help you further.

Azure blob storage - auto generate unique blob name

I am writing a small web application for Windows Azure, which should use the blob storage for, obviously, storing blobs.
Is there a function or a way to automatically generate a unique name for a blob on insert?
You can use a Guid for that:
string blobName = Guid.NewGuid().ToString();
There is nothing that generates a unique name "on insert"; you need to come up with the name ahead of time.
When choosing the name of your blob, be careful when using any algorithm that generates a sequential number of some kind (either at the beginning or the end of the name of a blob). Azure Storage relies of the name for load balancing; using sequential values can create contention in accessing/writing to Azure Blobs because it can prevent Azure from properly load-balancing its storage. You get 60MB/Sec on each node (i.e. server). So to ensure proper load-balancing and to leverage 60MB/Sec on multiple storage nodes you need to use random names for your blobs. I typically use Guids to avoid this problem, just as Sandrino is recommending.
In addition to what Sandrino said (using GUID which have very low probability of being duplicated) you can consider some third-party libraries which generate conflict-free identifiers example: Flake ID Generator
EDIT
Herve has pointed out very valid Azure Blob feature which should be considered with any blob names, namely, Azure Storage load balancing and blobs partitioning.
Azure keeps all blobs in partition servers. Which partition server should be used to store particular blob is decided on the blob container and the blob file name. Unfortunately I was not able to find and documentation describing algorithm used for blobs partitioning.
More on Azure Blob architecture can be found on Windows Azure Storage Architecture Overview article.
This is an old post, but it was the first hit that showed up for 'azure storage blob unique id' search.
It looks like the 'metadata_storage_path' generated property is unique, but it's not filterable so it may not be useful for some purposes.

Resources