In our project, we try to store big files, like images, in our nosql database. While making some researches, we hear of Document Db nosql database in Azure Cloud Platform. Additionally, we will store our datas in Azure.
What is the best way storing big files in Azure Platform?,
Document Db is good enough?,
Is it appropriate using MongoDb in Azure?
Though DocumentDB allows you to store files (they are stored as attachments), I would not recommend using it. Here are my reasons:
In the current version, the maximum size of an attachment is 2 MB.
You can't really stream attachments. You would need to first read the attachment contents in your application and stream it from there.
For storing files in Azure, I would highly recommend that you use Blob Storage. It is meant for that purpose only. Maximum size of a file that you can store in blob storage is 1 TB (which I would assume would be more than sufficient for you) and each storage account could hold up to 500 TB of data. Furthermore you can directly stream files to your end users.
DocumentDB does allow you to add files to your documents, called attachments. The advantage of using that feature is that the storage of your attachment is tied to the lifecycle of your document: if you delete the document, then the attachment will be deleted as well.
As DocDB is still in preview, you can expect the aforementioned limitations to be different once the service goes into General Availability.
You do not have to store the attachments in DocDB itself. DocDB allows you to do that, or to simple store a reference to your file as part of the attachment's metadata. This is useful is you want to store your file somewhere else, but need a reference to its location within the DocDB document. From the documentation:
DocumentDB allows you to store binary blobs/media either with DocumentDB or to your own remote media store. It also allows you to represent the metadata of a media in terms of a special document called attachment. An attachment in DocumentDB is a special (JSON) document which references the media/blob stored elsewhere. An attachment is simply a special document which captures the metadata (e.g. location, author etc.) of a media stored in a remote media storage.
If you need to store really big files into Azure (like video files or big engineering files), your best (and cheapest) option is probably to store your data in a Block Blob. You can then take the Uri of the blob and store it as an attachment metadata within your docdb document.
The preferable mechanism to store files would be Blob Storage as it is cheap. Keep in mind that Document DB is really expensive so event o store the file path in Blob Storage use Azure Tables.
Related
I am storing a series of Excel files in an Azure File Storage container for my company. My manager wants to be able to see the file created date for these files, as we will be running monthly downloads of the same reports. Is there a way to automate a means of storing the created date as one of the properties in Azure, or adding a bit of custom metadata, perhaps? Thanks in advance.
You can certainly store the created date as part of custom metadata for the file. However, there are certain things you would need to be aware of:
Metadata is editable: Anybody with access to the storage account can edit the metadata. They can change the created date metadata value or even delete that information.
Querying is painful: Azure File Storage doesn't provide querying capability so if you want to query on file's created date, it is going to be a painful process. First you would need to list all files in a share and then fetch metadata for each file separately. Depending on the number of files and the level of nesting, it could be a complicated process.
There are some alternatives available to you:
Use Blob Storage
If you can use Blob Storage instead of File Storage, use that. Blob Storage has a system defined property for created date so you don't have to do anything special. However like File Storage, Blob Storage also has an issue with querying but it is comparatively less painful.
Use Table Storage/SQL Database For Reporting
For querying purposes, you can store the file's created date in either Azure Table Storage or SQL Database. The downside of this approach is that because it is a completely separate system, it would be your responsibility to keep the data in sync. For example, if a file is deleted, you will need to ensure that entry for the same in the database is also removed.
I am trying to process some blobs in Azure Storage container. Our business users upload csv files to a blob container. The task is to process these files and persist the data in staging tables in Azure SQL DB for them to analyse later. This involves creating tables dynamically matching the file structure of the csv files. I have got this part working correctly. I am using python to accomplish this part of the task.
The next part of the task is to notify the user (who uploaded the blob) via an email once the blob has been processed in the DB by providing them with the table name corresponding to the blob. Ideally, I should also be able to set the permissions in the DB by giving read permissions to the user only on the table corresponding to the blob he uploaded.
To accomplish this, I thought I'll read the blob owner or last modified by attributes from the blob property and use that information for notification/db permissions. But I am not able to find any such property in blob properties. I tried using Diagnostic Logging at Storage account level but the logs also don't show any information about created by or modified by.
Can someone please guide me how can I go about getting this working?
As the information about who created/last modified a blob is not available as a system property, you will need to come up with your own implementation. I can think of a few solutions for that (without using an external database to store this information):
Store this information as blob's metadata: Each blob can have custom metadata. You can store this information in blob's metadata by creating two keys: CreatedBy and LastModifiedBy and store appropriate information. Please note that blob's metadata is not queryable and also it is very easy to overwrite the metadata. This is by far the easiest approach I could think of.
Make use of x-ms-client-request-id: With each request to Azure Storage, you can pass a custom value in x-ms-client-request-id request header. If storage analytics is enabled, this information gets logged. You could then query analytics data to find this information. However, it is extremely cumbersome to find this information in analytics logs as the information is saved as a line item in a blob in $logs container. To find this information, you would first need to find appropriate blob containing this information. Then you would need to download the blob, find the appropriate log entry and extract this information.
Considering none of the solution is perfect, I would recommend that you go with saving this information in an external database. It would be much simpler to accomplish your goal if you go with an external database.
Blobs in azure support custom metadata as a dictionary of key/value pairs you can save foreach file, but in my experience it's not handy in all the cases, specially because you can not query over those without read the blob (azure will charge you that cost) without having in mind the network transfer.
from:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-properties-metadata
Objects in Azure Storage support system properties and user-defined
metadata, in addition to the data they contain.
System properties: System properties exist on each storage resource.
Some of them can be read or set, while others are read-only. Under the
covers, some system properties correspond to certain standard HTTP
headers. The Azure storage client library maintains these for you.
User-defined metadata: User-defined metadata is metadata that you
specify on a given resource in the form of a name-value pair. You can
use metadata to store additional values with a storage resource. These
additional metadata values are for your own purposes only, and do not
affect how the resource behaves.
I had something very similar to do one time and to avoid creating external databases and connect that I've just created a table in the storage to save each file url from the blob storage without all the properties you need (user permissions) in a unstructured way.
You might find extremely straight forward to query information from the table with python (I did with .net) but I found it's pretty much the same.
https://learn.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-python
Azure Table storage and Azure Cosmos DB are services that store
structured NoSQL data in the cloud, providing a key/attribute store
with a schemaless design. Because Table storage and Azure Cosmos DB
are schemaless, it's easy to adapt your data as the needs of your
application evolve. Access to Table storage and Table API data is fast
and cost-effective for many types of applications, and is typically
lower in cost than traditional SQL for similar volumes of data.
Example code for filtering:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
table_service = TableService(connection_string='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;TableEndpoint=myendpoint;)
tasks = table_service.query_entities('tasktable', filter="PartitionKey eq 'tasksSeattle'")
for task in tasks:
print(task.description)
print(task.priority)
So you need only create the table and use the keys from azure to connect it.
Hope it helps you.
i am trying to find the best way to implement a small site allowing the user to upload a file and then search on it.
i used azure search with blob storage.
the file is stored on the blob storage and is then gets indexed by azure search indexer - so far so good.
the problem is that i would like to add to each document some custom data like file id and other business data, this data is not part of the document. is there a way to achieve this?
some one, suggested i use cosmos db, though i am not sure its the best way to go when it comes to documents.
Thanks
If you would like to keep using blob storage, you can store metadata with the blobs - just add custom metadata to your blobs, add corresponding fields to the search index, and the blob indexer will pick up the metadata.
I have a large amount of diagnostics data stored in an Azure Blob Storage. Is there any way I can get that data searchable from my Azure SQL database? I would like to join on some custom data fields in my blob stored data.
Blob storage doesn't have searchable metadata, per se: You may certainly search containers for given blob names, and you may even enumerate blobs to look at their metadata. But aside from the container/blob URI, there's no built-in search mechanisms.
If you want to search for metadata, you'll need to build your own data store (e.g.e in a searchable database such as SQL Database, the one you mentioned). This would be completely up to your app to do (you'd need to extract specific data you want to search, and store it in your database engine of choice). You'd then need to link your database engine's contents back to blob storage (e.g. store a blob's url alongside its metadata).
If you're talking about full-text search, you'd need to employ an appropriate fts tool. Azure provides Azure Search as a 1st-party full-text-search service, or you may certainly use a 3rd-party tool or service. What you choose is completely up to you.
I had some data that is stored on Azure Storage which is in compressed form and i want to decompress it so is it possible that i could decompress it without downloading it on the Virtual Machine. I mean to say that the storage could work in the same manner as my Secondary storage device does. Ask if you need more detail.
The answer is always "depends".
If it is possible - yes. Do you really want to do it - I am not sure.
Take the Blob Storage, because I assume you store your data in a blob storage. There are two different types of Blobs - Block Blobs and Page Blobs. Either can be updated by partially updating its content.
When having a block blob you can modify it using the Put Block operation on the Storage API. When you have a page blob, you can use the Put Page operation on the Blob Service API.
Of course after modifying the content you will have to send a final request to the Blob Service API to "commit" the changes and inform the service about the new content (Put Block List for BlockBlobs and implement robust retry logic for Put Page for Page blobs).
Although technically it is possible to manipulate the content on the blob without downloading the whole file, it really brings more complications than it solves. For example - once you modify part of the content of a file, all the checksums are now broken. Moreover - if it is a compressed file, you also have to modify the header of the file. At the end - if you know the exact structure of what you saved and you know which exact parts of it you want to modify - you can do it. But I think it will be just overengineering.