Query images in object storage by metadata - azure

I have over 10GB of images for my ecommerce app. I think move them to object storage (S3, Azure, Google, etc.).
So I will have an opportunity to add custom data to metadata (like NOSQL). For example, I have an image and corresponding metadata: product_id, sku, tags.
I want to query my images by metadata? For example, get all images from my object storage where meta_key = 'tag' and tag = 'nature'
So, object storage should have indexing capabilities. I do not want to iterate over billion of images to find only one of them.
I'm new to amazon aws, azure, google, openstack. I know that Amazon S3 is able to store metadata, but It doesn't have indexes (like Apache Solr).
What service is best suited to query files|objects by custom metadata?

To do this in AWS your best best is going to be to pair the object store (S3) with a traditional database to store the meta data for easy querying.
Depending on your needs DynamoDB or RDS (in the flavor of your choice) would be 2 AWS technologies to consider for the meta-data storage and retrieval.

Related

Best way to index data in Azure Blob Storage?

I plan on using Azure Blob storage to store images. I will have around 5000 categories for images that I plan on using folders to keep separated. For each of the image files, the file names won't differ a lot across the board and there is the potential to need to change metadata frequently.
My original plan was to use a SQL database to index all of these files and store my metadata there, but I'm second guessing that plan.
Is it feasible to index files in Azure Blob storage using a database, or should I just stick with using blob metadata?
Edit: I guess this question should really be "are there any downsides to indexing Azure Blob storage using a relational database?". I'm much more comfortable working with a DB than I am Azure storage, so my preference is to use a DB.
I'm second guessing whether or not to use a DB after looking at Azure storage more and discovering meta-tags and indexing. Hope this helps.
You can use Azure Search for this task as well, store images in Azure Storage (BLOB) and use Azure Search for crawling. indexing and searching. Using metadata you can enhance your search as well. This way you might not even need to use Folders to separate different categories.
Blob Index is a very feasible option and it can save the in the pricing, time, and overhead in terms of not using SQL.
https://azure.microsoft.com/en-gb/blog/manage-and-find-data-with-blob-index-for-azure-storage-now-in-preview/
If you are looking for more information on this preview feature, I would love hear more and work closer on this issue. Could you please reach me on BlobIndexPreview#microsoft.com.

How to query different S3-compatible object storage by Prestosql

Background
prestosql works great with data on S3 and S3 compatible object storage (e.g., IBM cloud object storage) when using the URI prefix s3a:// with S3 configuration with single HMAC key pair via hive.s3.aws-access-key and hive.s3.aws-secret-key by following the prestosql guide Amazon S3 Configuration - Hive Connector.
Question
When data is served in two different buckets across two cloud accounts, it means a client has to use two different HMAC key pairs to access objects. does it mean it has to configure two catalogs via hive connector in prestosql?
This is common case when using IBM cloud where object storage services are managed as instances for different cloud accounts.
Yes, you need to configure two separate hive catalogs.
Alternatively, you could use client-provided extra credentials (this is supported for GCS now, but can be easily extended to S3-compatible).

CreatedBy/LastModifiedBy information for a Blob in Azure Storage Container

I am trying to process some blobs in Azure Storage container. Our business users upload csv files to a blob container. The task is to process these files and persist the data in staging tables in Azure SQL DB for them to analyse later. This involves creating tables dynamically matching the file structure of the csv files. I have got this part working correctly. I am using python to accomplish this part of the task.
The next part of the task is to notify the user (who uploaded the blob) via an email once the blob has been processed in the DB by providing them with the table name corresponding to the blob. Ideally, I should also be able to set the permissions in the DB by giving read permissions to the user only on the table corresponding to the blob he uploaded.
To accomplish this, I thought I'll read the blob owner or last modified by attributes from the blob property and use that information for notification/db permissions. But I am not able to find any such property in blob properties. I tried using Diagnostic Logging at Storage account level but the logs also don't show any information about created by or modified by.
Can someone please guide me how can I go about getting this working?
As the information about who created/last modified a blob is not available as a system property, you will need to come up with your own implementation. I can think of a few solutions for that (without using an external database to store this information):
Store this information as blob's metadata: Each blob can have custom metadata. You can store this information in blob's metadata by creating two keys: CreatedBy and LastModifiedBy and store appropriate information. Please note that blob's metadata is not queryable and also it is very easy to overwrite the metadata. This is by far the easiest approach I could think of.
Make use of x-ms-client-request-id: With each request to Azure Storage, you can pass a custom value in x-ms-client-request-id request header. If storage analytics is enabled, this information gets logged. You could then query analytics data to find this information. However, it is extremely cumbersome to find this information in analytics logs as the information is saved as a line item in a blob in $logs container. To find this information, you would first need to find appropriate blob containing this information. Then you would need to download the blob, find the appropriate log entry and extract this information.
Considering none of the solution is perfect, I would recommend that you go with saving this information in an external database. It would be much simpler to accomplish your goal if you go with an external database.
Blobs in azure support custom metadata as a dictionary of key/value pairs you can save foreach file, but in my experience it's not handy in all the cases, specially because you can not query over those without read the blob (azure will charge you that cost) without having in mind the network transfer.
from:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-properties-metadata
Objects in Azure Storage support system properties and user-defined
metadata, in addition to the data they contain.
System properties: System properties exist on each storage resource.
Some of them can be read or set, while others are read-only. Under the
covers, some system properties correspond to certain standard HTTP
headers. The Azure storage client library maintains these for you.
User-defined metadata: User-defined metadata is metadata that you
specify on a given resource in the form of a name-value pair. You can
use metadata to store additional values with a storage resource. These
additional metadata values are for your own purposes only, and do not
affect how the resource behaves.
I had something very similar to do one time and to avoid creating external databases and connect that I've just created a table in the storage to save each file url from the blob storage without all the properties you need (user permissions) in a unstructured way.
You might find extremely straight forward to query information from the table with python (I did with .net) but I found it's pretty much the same.
https://learn.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-python
Azure Table storage and Azure Cosmos DB are services that store
structured NoSQL data in the cloud, providing a key/attribute store
with a schemaless design. Because Table storage and Azure Cosmos DB
are schemaless, it's easy to adapt your data as the needs of your
application evolve. Access to Table storage and Table API data is fast
and cost-effective for many types of applications, and is typically
lower in cost than traditional SQL for similar volumes of data.
Example code for filtering:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
table_service = TableService(connection_string='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;TableEndpoint=myendpoint;)
tasks = table_service.query_entities('tasktable', filter="PartitionKey eq 'tasksSeattle'")
for task in tasks:
print(task.description)
print(task.priority)
So you need only create the table and use the keys from azure to connect it.
Hope it helps you.

Search blob storage data

I have a large amount of diagnostics data stored in an Azure Blob Storage. Is there any way I can get that data searchable from my Azure SQL database? I would like to join on some custom data fields in my blob stored data.
Blob storage doesn't have searchable metadata, per se: You may certainly search containers for given blob names, and you may even enumerate blobs to look at their metadata. But aside from the container/blob URI, there's no built-in search mechanisms.
If you want to search for metadata, you'll need to build your own data store (e.g.e in a searchable database such as SQL Database, the one you mentioned). This would be completely up to your app to do (you'd need to extract specific data you want to search, and store it in your database engine of choice). You'd then need to link your database engine's contents back to blob storage (e.g. store a blob's url alongside its metadata).
If you're talking about full-text search, you'd need to employ an appropriate fts tool. Azure provides Azure Search as a 1st-party full-text-search service, or you may certainly use a 3rd-party tool or service. What you choose is completely up to you.

Document Db Usage on Azure

In our project, we try to store big files, like images, in our nosql database. While making some researches, we hear of Document Db nosql database in Azure Cloud Platform. Additionally, we will store our datas in Azure.
What is the best way storing big files in Azure Platform?,
Document Db is good enough?,
Is it appropriate using MongoDb in Azure?
Though DocumentDB allows you to store files (they are stored as attachments), I would not recommend using it. Here are my reasons:
In the current version, the maximum size of an attachment is 2 MB.
You can't really stream attachments. You would need to first read the attachment contents in your application and stream it from there.
For storing files in Azure, I would highly recommend that you use Blob Storage. It is meant for that purpose only. Maximum size of a file that you can store in blob storage is 1 TB (which I would assume would be more than sufficient for you) and each storage account could hold up to 500 TB of data. Furthermore you can directly stream files to your end users.
DocumentDB does allow you to add files to your documents, called attachments. The advantage of using that feature is that the storage of your attachment is tied to the lifecycle of your document: if you delete the document, then the attachment will be deleted as well.
As DocDB is still in preview, you can expect the aforementioned limitations to be different once the service goes into General Availability.
You do not have to store the attachments in DocDB itself. DocDB allows you to do that, or to simple store a reference to your file as part of the attachment's metadata. This is useful is you want to store your file somewhere else, but need a reference to its location within the DocDB document. From the documentation:
DocumentDB allows you to store binary blobs/media either with DocumentDB or to your own remote media store. It also allows you to represent the metadata of a media in terms of a special document called attachment. An attachment in DocumentDB is a special (JSON) document which references the media/blob stored elsewhere. An attachment is simply a special document which captures the metadata (e.g. location, author etc.) of a media stored in a remote media storage.
If you need to store really big files into Azure (like video files or big engineering files), your best (and cheapest) option is probably to store your data in a Block Blob. You can then take the Uri of the blob and store it as an attachment metadata within your docdb document.
The preferable mechanism to store files would be Blob Storage as it is cheap. Keep in mind that Document DB is really expensive so event o store the file path in Blob Storage use Azure Tables.

Resources