How to Index the Blob Storage in Azure?

How to Index the Blob Storage in Azure? - azure

How do we index the blob storage? Are there any .NET SDK available, If yes I am not able to find. What I can see is the API calls that one has to make to create Index and Indexers.
Thanks

As such blob storage is not indexable. What you will need to do is make use of Azure Search service and pull the data from blob storage to an Azure Search index. That makes the blob storage data searchable.
To pull the data from Azure Blob Storage into Azure Search Service index, you will need to create a Blob Data Source and an Indexer. An indexer will be responsible for fetching the data from blobs and populating the index.
You may find this link useful for indexing blob storage using Azure Search: https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage.
There's a .Net SDK available for managing Azure Search Service indexes, data sources and indexers. You can read more about it here: https://learn.microsoft.com/en-us/dotnet/api/overview/azure/search?view=azure-dotnet. Also, Azure Search team has published some samples on Github which makes use of this SDK. You can find them here: https://github.com/Azure-Samples/search-dotnet-getting-started.

BLOB INDEX feature in preview May 2020
Announced 4th May 2020. As with other preview features its not recommended to deploy on a production environment.
Worth noting:
The key Microsoft disclaimers to using a product or feature in preview status are as follows:
All previews are excluded from Microsoft SLAs and Warranties
Previews might not include customer support from Microsoft
Preview might not be brought forward into General Release status
Very likely this will make it to GA IMO, clearly required, similar features available on other platforms.
Note: To populate the blob index, you define key-value tag attributes on your data, either on new data during upload or on existing data already in your storage account, GPv2 Storage Accounts only.
Posted on 4 May, 2020
Blob Index—a managed secondary index, allowing you to store multi-dimensional object attributes to describe your data objects for Azure Blob storage—is now available in preview. Built on top of blob storage, Blob Index offers consistent reliability, availability, and performance for all your workloads. Blob Index provides native object management and filtering capabilities, which allows you to categorize and find data based on attribute tags set on the data.
Manage and find data with Blob Index
As datasets get larger, finding specific related objects in a sea of data can be difficult and frustrating. Previously, clients used the ListBlobs API to retrieve 5000 lexicographical records at a time, parse through the list, and repeat until you found the blobs you wanted. Some users also resorted to managing a separate lookup table to find specific objects. These separate tables can get out-of-sync—increasing cost, complexity, and frustration. Customers should not have to worry about data organization or index table management, and instead focus on building powerful applications to grow their business.
Blob Index alleviates the data management and querying problem with support for all blob types (Block Blob, Append Blob, and Page Blob). Blob Index is exposed through a familiar blob storage endpoint and APIs, allowing you to easily store and access both your data and classification indices on the same service to reduce application complexity.
To populate the blob index, you define key-value tag attributes on your data, either on new data during upload or on existing data already in your storage account. These blob index tags are stored alongside your underlying blob data. The blob indexing engine then automatically reads the new tags, indexes them, and exposes them to a user-queryable blob index. Using the Azure portal, REST APIs, or SDKs, you can then issue a FindBlobsByTags API call specify a set of criteria. Blob storage will return a filtered result set consisting only of the blobs that met the match criteria.
The below scenario is an example of how Blob Index works:
In a storage account container with a million blobs, a user uploads a new blob “B2” with the following blob index tags: < Status = Unprocessed, Quality = 8K, Source = RAW >.
The blob and its blob index tags are persisted to the storage account and the account indexing engine exposes the new blob index shortly after.
Later on, an encoding application wants to find all unprocessed media files that are at least 4K resolution quality. It issues a FindBlobs API call to find all blobs that match the following criteria: < Status = Unprocessed AND Quality >= 4K AND Status == RAW>.
The blob index quickly returns just blob “B2,” the sole blob out of one million blobs that matches the specified criteria. The encoding application can quickly start its processing job, saving idle compute time and money.
Platform feature integrations with Blob Index
Blob Index not only helps you categorize, manage, and find your blob data but also provides integrations with other Blob service features, such as Lifecycle management.
Using the new blobIndexMatch as a filter, you can move data to cooler tiers or delete data based on the tags applied to your blobs. This allows you to be more granular in your rules and only move or delete data if they match your specified criteria.
The following sample lifecycle management policy applies to block blobs in the “videofiles” container and tiers objects to archive storage after one day only if the blobs match the blob index tag of Status = ‘Processed’ and Source = ‘RAW’.
Lifecycle management rule with blobIndexMatch example.
Lifecycle management integration with Blob Index is just the beginning. We will be adding more integrations with other blob platform features soon!
Conditional blob operations with Blob Index tags
In REST versions 2019-10-10 and higher, most blob service APIs now support a new conditional header, x-ms-if-tags, so that the operation will only succeed if the specified blob index tags condition is met. If the condition is not met, the operation will fail, thus not modifying the blob. This functionality by Blob Index can help ensure data operations only occur on explicitly tagged blobs and can protect against inadvertent deletion or modification by multi-threaded applications.
How to get started
To enroll in the Blog Index preview, submit a request to register this feature to your subscription by running the following PowerShell or CLI commands:
Register by using PowerShell
Register-AzProviderFeature -FeatureName BlobIndex -ProviderNamespace Microsoft.Storage
Register-AzResourceProvider -ProviderNamespace Microsoft.Storage
Register by using Azure CLI
az feature register --namespace Microsoft.Storage --name BlobIndex
az provider register --namespace 'Microsoft.Storage'
After your request is approved, any existing or new General-purpose v2 (GPv2) storage accounts in France Central and France South can leverage Blob Index’s capabilities. As with most previews, we recommend that this feature should not be used for production workloads until it reaches general availability.
Ref:
https://azure.microsoft.com/en-gb/blog/manage-and-find-data-with-blob-index-for-azure-storage-now-in-preview/

Related

Azure Blob Storage : Virtual Folder structure vs Blob Index Tags

I am trying to figure out what is the benefit of Index Tags vs Creating a full Virtual Folder tree structure in azure blob storage, when i have full programatic control over the creation of the blobs.

Virtual Folder structure vs Blob Index Tags
You're asking us to compare just two separate features of Azure Blob Storage as though they were mutually exclusive, when in-fact they can be used together, and there are more options for organizing blobs than just those 2 options:
TL;DR:
Azure Blob Index Tags - arbitrary mutable tags on your blobs.
Virtual folder structure - this is just a naming convention where your blobs are named with slash-separated "directory" names.
NFS 3.0 Blob Storage and Data Lake Storage Gen2 - this is a major new version (or revision) of Azure Blob Storage that makes it behave almost exactly like a traditional disk file-system (hence the NFS 3.0-compliance) however it (currently) comes with major shortcomings.
In detail:
Azure Blob Index Tags is a recently introduced new feature to Azure Blob Storage: it entered preview in May 2020 and left the preview-stage in June 2021 (2 months ago at the time of writing).
Your storage account needs to be "General Purpose v2" - so if you have a an older-style storage account you'll need to update it.
Advantages:
It's built-in to Azure Blob Storage, so you don't need to maintain your own indexing infrastructure (which is what we used to have to do: I stored my own blob index in a table in Azure Table Storage in the same storage account, and had a process that ran on a disposable Azure VM nightly to index new blobs).
As it's a tagging system it means you can have your own taxonomy and don't have to force your nomenclature into a single hierarchy as with virtual folders.
Tags are mutable: you can add/remove/edit them as you like.
Disadvantages:
As with maintaining your own blob index the index updates are not instantaneous (unlike compared to an RDBMS where indexes are always up-to-date). The blog article linked handwaves this away by saying:
and the account indexing engine exposes the new blob index shortly after."
...note that they don't define what "shortly" means.
As of August 2021, Azure charges $0.03 per 10,000 tags (regardless of the storage-tier in use). So if you have 1,000,000 blobs and 3 tags per blob, then that's $9/mo.
This isn't a significant cost by any means, but the cost-per-information-theoretic-unit is kinda-high, which is disappointing.
"Virtual Folder tree structure" - By this I assume you mean giving your blob's hierarchical naming system and using Azure Blob Storage's blob-name-prefix search filter.
Advantages:
Tried-and-tested. Simple.
Doesn't cost you anything.
No indexing delay.
Disadvantages:
It's still as slow as enumerating blobs lexicographically.
You cannot conceptually move or rename blobs.
(You can, technically, provided source and destination are in the same container by doing a copy+delete, and the copy operation should be instantaneous as I understand that Blob Storage uses COW for same-container copies, but it's still imperfect: the client API still exposes it as an asynchronous operation with an unbounded time-to-copy rather than giving hard guarantees)
The fact this has been a limitation of Azure Blob Storage for a decade now utterly confounds me.
NFS 3.0 Blob Storage - Also new in 2020/2021 with Blob Index Tags is NFS 3.0 Blob Storage, which gives you a full "real" hierarchical filesystem for your blobs.
The Hierarchical Namespace feature is powered by Azure Data Lake Storage Gen 2. I don't know any technical details of this so I can't say anything.
Advantages:
NFS 3.0-compliant (that's huge!) so Linux clients can even mount it directly.
It's cheaper than normal blob storage (whaaaaat?!):
In West US 2, NFS+LRS+Hot is $0.018/GB while the old-school flat namespace with LRS+Hot is $0.0184/GB.
In other Azure locations and with other redundancy options then NFS can be slightly more expensive, but otherwise they're generally within $0.01 of each other.
Disadvantages:
Apparently you're limited to only block-blobs: not page-blobs or append-blobs.
Notes from the Known Issues page:
NFS can only be used with new accounts: you cannot update an existing account. You also cannot disable it once you enable it.
You cannot (currently) lock blobs/files - though this looks to come in a future version.
You cannot use both Blob Index Tags and NFS in the same storage account - or in fact most features of Blob Storage (ooo-er!).
The documentation for operations exclusively to Hierarchical namespace blobs only lists Set Blob Expiry - there (still) doesn't seem to be a synchronous/atomic "move blob" or "rename blob" operation, instead the Protocol Support page implies that an operation to rename an NFS file will be translated into raw blob storage operations behind-the-scenes... so I'm curious how they do that atomically.
When your application makes a request by using the NFS 3.0 protocol, that request is translated into combination of block blob operations. For example, NFS 3.0 read Remote Procedure Call (RPC) requests are translated into Get Blob operation. NFS 3.0 write RPC requests are translated into a combination of Get Block List, Put Block, and Put Block List.
Alternative concept: Content-addressable-storage
Because blobs cannot be atomically/synchronously renamed so a few years ago I simply gave up trying to come up with a perfect blob nomenclature that would stand the test of time because business requirements always change.
Instead, I noticed that my blobs were invariably immutable: once they've been uploaded to storage they're never updated, or when they are updated they're saved to new, separate blobs - which means that a content-addressable naming strategy suited my projects perfectly.
In short: give your immutable blobs a name which is a string-representation of a hash of their content, and store their hashes in a traditional RDBMS where you have much greater flexibility (and ideally: performance) with how they're indexed and referenced by the rest of your system.
In my case, I set my blobs' names to the Base-16 representation of their SHA-256 hash.
Advantages:
You get de-duping for free: blobs with identical content will have identical hashes, so you can avoid uploading/downloading the same huge blob twice.
You get integrity checks for free: if you download a blob and its hash doesn't match its blob-name then your storage account likely got hacked)
Disadvantages:
You still need to maintain your own index in your RDBMS (if applicable) - but you can still use Blob Index Tags with content-addressable storage if you like.

CreatedBy/LastModifiedBy information for a Blob in Azure Storage Container

I am trying to process some blobs in Azure Storage container. Our business users upload csv files to a blob container. The task is to process these files and persist the data in staging tables in Azure SQL DB for them to analyse later. This involves creating tables dynamically matching the file structure of the csv files. I have got this part working correctly. I am using python to accomplish this part of the task.
The next part of the task is to notify the user (who uploaded the blob) via an email once the blob has been processed in the DB by providing them with the table name corresponding to the blob. Ideally, I should also be able to set the permissions in the DB by giving read permissions to the user only on the table corresponding to the blob he uploaded.
To accomplish this, I thought I'll read the blob owner or last modified by attributes from the blob property and use that information for notification/db permissions. But I am not able to find any such property in blob properties. I tried using Diagnostic Logging at Storage account level but the logs also don't show any information about created by or modified by.
Can someone please guide me how can I go about getting this working?

As the information about who created/last modified a blob is not available as a system property, you will need to come up with your own implementation. I can think of a few solutions for that (without using an external database to store this information):
Store this information as blob's metadata: Each blob can have custom metadata. You can store this information in blob's metadata by creating two keys: CreatedBy and LastModifiedBy and store appropriate information. Please note that blob's metadata is not queryable and also it is very easy to overwrite the metadata. This is by far the easiest approach I could think of.
Make use of x-ms-client-request-id: With each request to Azure Storage, you can pass a custom value in x-ms-client-request-id request header. If storage analytics is enabled, this information gets logged. You could then query analytics data to find this information. However, it is extremely cumbersome to find this information in analytics logs as the information is saved as a line item in a blob in $logs container. To find this information, you would first need to find appropriate blob containing this information. Then you would need to download the blob, find the appropriate log entry and extract this information.
Considering none of the solution is perfect, I would recommend that you go with saving this information in an external database. It would be much simpler to accomplish your goal if you go with an external database.

Blobs in azure support custom metadata as a dictionary of key/value pairs you can save foreach file, but in my experience it's not handy in all the cases, specially because you can not query over those without read the blob (azure will charge you that cost) without having in mind the network transfer.
from:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-properties-metadata
Objects in Azure Storage support system properties and user-defined
metadata, in addition to the data they contain.
System properties: System properties exist on each storage resource.
Some of them can be read or set, while others are read-only. Under the
covers, some system properties correspond to certain standard HTTP
headers. The Azure storage client library maintains these for you.
User-defined metadata: User-defined metadata is metadata that you
specify on a given resource in the form of a name-value pair. You can
use metadata to store additional values with a storage resource. These
additional metadata values are for your own purposes only, and do not
affect how the resource behaves.
I had something very similar to do one time and to avoid creating external databases and connect that I've just created a table in the storage to save each file url from the blob storage without all the properties you need (user permissions) in a unstructured way.
You might find extremely straight forward to query information from the table with python (I did with .net) but I found it's pretty much the same.
https://learn.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-python
Azure Table storage and Azure Cosmos DB are services that store
structured NoSQL data in the cloud, providing a key/attribute store
with a schemaless design. Because Table storage and Azure Cosmos DB
are schemaless, it's easy to adapt your data as the needs of your
application evolve. Access to Table storage and Table API data is fast
and cost-effective for many types of applications, and is typically
lower in cost than traditional SQL for similar volumes of data.
Example code for filtering:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
table_service = TableService(connection_string='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;TableEndpoint=myendpoint;)
tasks = table_service.query_entities('tasktable', filter="PartitionKey eq 'tasksSeattle'")
for task in tasks:
print(task.description)
print(task.priority)
So you need only create the table and use the keys from azure to connect it.
Hope it helps you.

Is Azure Blob storage the right place to store many (small) communication logs?

I am working with a program which connects to multiple APIs, the logs for each operation (HTML/XML/Json) need to be stored for possible later review. Is it feasible to store each request/reply in an Azure blob? There can be hundreds of requests per second (all of which need storing) which vary in size and have an average size of 100kB.
Because the logs need to be searchable (by metadata) my plan is to store it in Azure Blob and put metadata (with blob locations, custom application-related request and content identifiers, etc) in an easily-searchable database.

You can store logs in the Azure table storage or Blob storage but Microsoft itself recommends using Blob storage. Azure Storage Analytics stores log data in Blob storage.
This 'Azure Storage Table Design Guide' points out several draw backs of using table storage for logs and also provides details on how to use the blob storage to store logs. Read the 'Log data anti-pattern' section in particular for this use case.

Determining the size of table and blob in Azure Storage Service

I would like to determine the table size for each table and each blob entity size of my azure storage.
I tried using $MetricsTransactionsBlob, $MetricsTransactionsTable,$MetricsTransactionsQueue Tables and the analytics of storage service but am not able to determine the size of a given table or a given blob.
I have also tried to download cerabata azure management studio but am unable to determine size of table / blob.
Can someone share sample code which can help me to determine the size?

As far as finding the size for a table in Azure Table Storage, there's nothing available out there to the best of my knowledge. This is something you would need to do on your own. You would need to fetch all entities from a table and calculate the size of each entity. You may find this blog post useful: http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/30/how-the-size-of-an-entity-is-caclulated-in-windows-azure-table-storage.aspx
Coming to blob containers, storage analytics provide you this capability to some extent. You can find the total size of blob storage using $MetricsCapacityBlob table. More information about it here: http://msdn.microsoft.com/en-us/library/windowsazure/hh343264.aspx.
You mentioned that you're using Azure Management Studio from Cerebrata. AMS has support for storage analytics so you can explore $MetricsCapacityBlob table using the tool itself. Also you can find the size of a blob container by right clicking on the blob container in question and clicking on Storage Statistics context menu item. Other alternative from Cerebrata is their Azure Management Cmdlets product (http://www.cerebrata.com/products/azure-management-cmdlets/features). You can use Get-BlobContainerSize Cmdlet to find the size and total number of blobs in a blob container.

Azure blob storage - auto generate unique blob name

I am writing a small web application for Windows Azure, which should use the blob storage for, obviously, storing blobs.
Is there a function or a way to automatically generate a unique name for a blob on insert?

You can use a Guid for that:
string blobName = Guid.NewGuid().ToString();

There is nothing that generates a unique name "on insert"; you need to come up with the name ahead of time.
When choosing the name of your blob, be careful when using any algorithm that generates a sequential number of some kind (either at the beginning or the end of the name of a blob). Azure Storage relies of the name for load balancing; using sequential values can create contention in accessing/writing to Azure Blobs because it can prevent Azure from properly load-balancing its storage. You get 60MB/Sec on each node (i.e. server). So to ensure proper load-balancing and to leverage 60MB/Sec on multiple storage nodes you need to use random names for your blobs. I typically use Guids to avoid this problem, just as Sandrino is recommending.

In addition to what Sandrino said (using GUID which have very low probability of being duplicated) you can consider some third-party libraries which generate conflict-free identifiers example: Flake ID Generator
EDIT
Herve has pointed out very valid Azure Blob feature which should be considered with any blob names, namely, Azure Storage load balancing and blobs partitioning.
Azure keeps all blobs in partition servers. Which partition server should be used to store particular blob is decided on the blob container and the blob file name. Unfortunately I was not able to find and documentation describing algorithm used for blobs partitioning.
More on Azure Blob architecture can be found on Windows Azure Storage Architecture Overview article.

This is an old post, but it was the first hit that showed up for 'azure storage blob unique id' search.
It looks like the 'metadata_storage_path' generated property is unique, but it's not filterable so it may not be useful for some purposes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string