Search blob storage data - azure

I have a large amount of diagnostics data stored in an Azure Blob Storage. Is there any way I can get that data searchable from my Azure SQL database? I would like to join on some custom data fields in my blob stored data.

Blob storage doesn't have searchable metadata, per se: You may certainly search containers for given blob names, and you may even enumerate blobs to look at their metadata. But aside from the container/blob URI, there's no built-in search mechanisms.
If you want to search for metadata, you'll need to build your own data store (e.g.e in a searchable database such as SQL Database, the one you mentioned). This would be completely up to your app to do (you'd need to extract specific data you want to search, and store it in your database engine of choice). You'd then need to link your database engine's contents back to blob storage (e.g. store a blob's url alongside its metadata).
If you're talking about full-text search, you'd need to employ an appropriate fts tool. Azure provides Azure Search as a 1st-party full-text-search service, or you may certainly use a 3rd-party tool or service. What you choose is completely up to you.

Related

Best way to index data in Azure Blob Storage?

I plan on using Azure Blob storage to store images. I will have around 5000 categories for images that I plan on using folders to keep separated. For each of the image files, the file names won't differ a lot across the board and there is the potential to need to change metadata frequently.
My original plan was to use a SQL database to index all of these files and store my metadata there, but I'm second guessing that plan.
Is it feasible to index files in Azure Blob storage using a database, or should I just stick with using blob metadata?
Edit: I guess this question should really be "are there any downsides to indexing Azure Blob storage using a relational database?". I'm much more comfortable working with a DB than I am Azure storage, so my preference is to use a DB.
I'm second guessing whether or not to use a DB after looking at Azure storage more and discovering meta-tags and indexing. Hope this helps.
You can use Azure Search for this task as well, store images in Azure Storage (BLOB) and use Azure Search for crawling. indexing and searching. Using metadata you can enhance your search as well. This way you might not even need to use Folders to separate different categories.
Blob Index is a very feasible option and it can save the in the pricing, time, and overhead in terms of not using SQL.
https://azure.microsoft.com/en-gb/blog/manage-and-find-data-with-blob-index-for-azure-storage-now-in-preview/
If you are looking for more information on this preview feature, I would love hear more and work closer on this issue. Could you please reach me on BlobIndexPreview#microsoft.com.

CreatedBy/LastModifiedBy information for a Blob in Azure Storage Container

I am trying to process some blobs in Azure Storage container. Our business users upload csv files to a blob container. The task is to process these files and persist the data in staging tables in Azure SQL DB for them to analyse later. This involves creating tables dynamically matching the file structure of the csv files. I have got this part working correctly. I am using python to accomplish this part of the task.
The next part of the task is to notify the user (who uploaded the blob) via an email once the blob has been processed in the DB by providing them with the table name corresponding to the blob. Ideally, I should also be able to set the permissions in the DB by giving read permissions to the user only on the table corresponding to the blob he uploaded.
To accomplish this, I thought I'll read the blob owner or last modified by attributes from the blob property and use that information for notification/db permissions. But I am not able to find any such property in blob properties. I tried using Diagnostic Logging at Storage account level but the logs also don't show any information about created by or modified by.
Can someone please guide me how can I go about getting this working?
As the information about who created/last modified a blob is not available as a system property, you will need to come up with your own implementation. I can think of a few solutions for that (without using an external database to store this information):
Store this information as blob's metadata: Each blob can have custom metadata. You can store this information in blob's metadata by creating two keys: CreatedBy and LastModifiedBy and store appropriate information. Please note that blob's metadata is not queryable and also it is very easy to overwrite the metadata. This is by far the easiest approach I could think of.
Make use of x-ms-client-request-id: With each request to Azure Storage, you can pass a custom value in x-ms-client-request-id request header. If storage analytics is enabled, this information gets logged. You could then query analytics data to find this information. However, it is extremely cumbersome to find this information in analytics logs as the information is saved as a line item in a blob in $logs container. To find this information, you would first need to find appropriate blob containing this information. Then you would need to download the blob, find the appropriate log entry and extract this information.
Considering none of the solution is perfect, I would recommend that you go with saving this information in an external database. It would be much simpler to accomplish your goal if you go with an external database.
Blobs in azure support custom metadata as a dictionary of key/value pairs you can save foreach file, but in my experience it's not handy in all the cases, specially because you can not query over those without read the blob (azure will charge you that cost) without having in mind the network transfer.
from:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-properties-metadata
Objects in Azure Storage support system properties and user-defined
metadata, in addition to the data they contain.
System properties: System properties exist on each storage resource.
Some of them can be read or set, while others are read-only. Under the
covers, some system properties correspond to certain standard HTTP
headers. The Azure storage client library maintains these for you.
User-defined metadata: User-defined metadata is metadata that you
specify on a given resource in the form of a name-value pair. You can
use metadata to store additional values with a storage resource. These
additional metadata values are for your own purposes only, and do not
affect how the resource behaves.
I had something very similar to do one time and to avoid creating external databases and connect that I've just created a table in the storage to save each file url from the blob storage without all the properties you need (user permissions) in a unstructured way.
You might find extremely straight forward to query information from the table with python (I did with .net) but I found it's pretty much the same.
https://learn.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-python
Azure Table storage and Azure Cosmos DB are services that store
structured NoSQL data in the cloud, providing a key/attribute store
with a schemaless design. Because Table storage and Azure Cosmos DB
are schemaless, it's easy to adapt your data as the needs of your
application evolve. Access to Table storage and Table API data is fast
and cost-effective for many types of applications, and is typically
lower in cost than traditional SQL for similar volumes of data.
Example code for filtering:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
table_service = TableService(connection_string='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;TableEndpoint=myendpoint;)
tasks = table_service.query_entities('tasktable', filter="PartitionKey eq 'tasksSeattle'")
for task in tasks:
print(task.description)
print(task.priority)
So you need only create the table and use the keys from azure to connect it.
Hope it helps you.

Storing Documents on Azure with custom metadata

i am trying to find the best way to implement a small site allowing the user to upload a file and then search on it.
i used azure search with blob storage.
the file is stored on the blob storage and is then gets indexed by azure search indexer - so far so good.
the problem is that i would like to add to each document some custom data like file id and other business data, this data is not part of the document. is there a way to achieve this?
some one, suggested i use cosmos db, though i am not sure its the best way to go when it comes to documents.
Thanks
If you would like to keep using blob storage, you can store metadata with the blobs - just add custom metadata to your blobs, add corresponding fields to the search index, and the blob indexer will pick up the metadata.

Query blobs in Blob storage

I have serialized text data that is stored in a blob inside Azure blob storage. The text is basically key/value data. I am wondering if there is a way to easily query the blob without exploding the data into another table/database or pulling the blob into memory?
Azure Blob storage has no API to query data within the blob - it's just dumb storage. See here for the Blob Storage API. You're essentially stuck reading, deserializing and grabbing your value(s).
Perhaps Azure table storage would be a better fit for this application? That at least keeps things in the realm of an Azure storage account rather than needing to pull in a SQL Server instance.
One option you could consider is to use Data Lake Analytics, as it supports Azure Blobs as data source.
Depending on what your preferred way of accessing the data is, you can use PowerShell, .NET SDK etc. to query the data...

Document Db Usage on Azure

In our project, we try to store big files, like images, in our nosql database. While making some researches, we hear of Document Db nosql database in Azure Cloud Platform. Additionally, we will store our datas in Azure.
What is the best way storing big files in Azure Platform?,
Document Db is good enough?,
Is it appropriate using MongoDb in Azure?
Though DocumentDB allows you to store files (they are stored as attachments), I would not recommend using it. Here are my reasons:
In the current version, the maximum size of an attachment is 2 MB.
You can't really stream attachments. You would need to first read the attachment contents in your application and stream it from there.
For storing files in Azure, I would highly recommend that you use Blob Storage. It is meant for that purpose only. Maximum size of a file that you can store in blob storage is 1 TB (which I would assume would be more than sufficient for you) and each storage account could hold up to 500 TB of data. Furthermore you can directly stream files to your end users.
DocumentDB does allow you to add files to your documents, called attachments. The advantage of using that feature is that the storage of your attachment is tied to the lifecycle of your document: if you delete the document, then the attachment will be deleted as well.
As DocDB is still in preview, you can expect the aforementioned limitations to be different once the service goes into General Availability.
You do not have to store the attachments in DocDB itself. DocDB allows you to do that, or to simple store a reference to your file as part of the attachment's metadata. This is useful is you want to store your file somewhere else, but need a reference to its location within the DocDB document. From the documentation:
DocumentDB allows you to store binary blobs/media either with DocumentDB or to your own remote media store. It also allows you to represent the metadata of a media in terms of a special document called attachment. An attachment in DocumentDB is a special (JSON) document which references the media/blob stored elsewhere. An attachment is simply a special document which captures the metadata (e.g. location, author etc.) of a media stored in a remote media storage.
If you need to store really big files into Azure (like video files or big engineering files), your best (and cheapest) option is probably to store your data in a Block Blob. You can then take the Uri of the blob and store it as an attachment metadata within your docdb document.
The preferable mechanism to store files would be Blob Storage as it is cheap. Keep in mind that Document DB is really expensive so event o store the file path in Blob Storage use Azure Tables.

Resources