Is Azure Blob storage the right place to store many (small) communication logs? - azure

I am working with a program which connects to multiple APIs, the logs for each operation (HTML/XML/Json) need to be stored for possible later review. Is it feasible to store each request/reply in an Azure blob? There can be hundreds of requests per second (all of which need storing) which vary in size and have an average size of 100kB.
Because the logs need to be searchable (by metadata) my plan is to store it in Azure Blob and put metadata (with blob locations, custom application-related request and content identifiers, etc) in an easily-searchable database.

You can store logs in the Azure table storage or Blob storage but Microsoft itself recommends using Blob storage. Azure Storage Analytics stores log data in Blob storage.
This 'Azure Storage Table Design Guide' points out several draw backs of using table storage for logs and also provides details on how to use the blob storage to store logs. Read the 'Log data anti-pattern' section in particular for this use case.

Related

Azure Blob Storage: Does Microsoft Implement Redundant Backups?

I've searched the web and contacted technical support yet no one seems to be able to give me a straight answer on whether items in Azure Blob Storage are backed up or not.
What I mean is, do I need to create a twin storage account as a "backup" and program copies of all content from one storage to another, or are the contents of a client's Blob Storage automatically redundantly backed up by Microsoft?
I know with AWS, storage is redundantly backed up via onsite drives as well as across other nodes in the cluster.
do I need to create a twin storage account as a "backup" and program
copies of all content from one storage to another, or are the contents
of a client's Blob Storage automatically redundantly backed up by
Microsoft?
Yes, you will need to do backup manually. Azure Storage does not back up the contents of your storage account automatically.
Azure Storage does provide geo-redundant replication (provided you configure the redundancy level for your storage account as GRS or RA-GRS) but that is not back up. Once you delete content from your primary account (location, it will automatically be removed from secondary account (geo-redundant location).
Both AWS (EBS) and Azure(Blob Storage) options provides durability by replicating the data across different data centers. This is for the high availability and durability of the data to provide the guarantee by the cloud provider.
In order to ensure that your data is durable, Azure Storage has the
ability to keep (and manage) multiple copies of your data. This is
called replication, or sometimes redundancy. When you set up your
storage account, you select a replication type. In most cases, this
setting can be modified after the storage account is set up.
For more details refer the replication section in documentation.
If you need to capture changes to the storage and allow restore to previous versions (e.g In situations like data corruption or application feature requirements like restore points, backups), you need to take a SnapShot manually. This is common for both AWS and Azure.
For more details on creating a Snapshot of Blob in Azure refer the documentation.

How to Index the Blob Storage in Azure?

How do we index the blob storage? Are there any .NET SDK available, If yes I am not able to find. What I can see is the API calls that one has to make to create Index and Indexers.
Thanks
As such blob storage is not indexable. What you will need to do is make use of Azure Search service and pull the data from blob storage to an Azure Search index. That makes the blob storage data searchable.
To pull the data from Azure Blob Storage into Azure Search Service index, you will need to create a Blob Data Source and an Indexer. An indexer will be responsible for fetching the data from blobs and populating the index.
You may find this link useful for indexing blob storage using Azure Search: https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage.
There's a .Net SDK available for managing Azure Search Service indexes, data sources and indexers. You can read more about it here: https://learn.microsoft.com/en-us/dotnet/api/overview/azure/search?view=azure-dotnet. Also, Azure Search team has published some samples on Github which makes use of this SDK. You can find them here: https://github.com/Azure-Samples/search-dotnet-getting-started.
BLOB INDEX feature in preview May 2020
Announced 4th May 2020. As with other preview features its not recommended to deploy on a production environment.
Worth noting:
The key Microsoft disclaimers to using a product or feature in preview status are as follows:
All previews are excluded from Microsoft SLAs and Warranties
Previews might not include customer support from Microsoft
Preview might not be brought forward into General Release status
Very likely this will make it to GA IMO, clearly required, similar features available on other platforms.
Note: To populate the blob index, you define key-value tag attributes on your data, either on new data during upload or on existing data already in your storage account, GPv2 Storage Accounts only.
Posted on 4 May, 2020
Blob Index—a managed secondary index, allowing you to store multi-dimensional object attributes to describe your data objects for Azure Blob storage—is now available in preview. Built on top of blob storage, Blob Index offers consistent reliability, availability, and performance for all your workloads. Blob Index provides native object management and filtering capabilities, which allows you to categorize and find data based on attribute tags set on the data.
Manage and find data with Blob Index
As datasets get larger, finding specific related objects in a sea of data can be difficult and frustrating. Previously, clients used the ListBlobs API to retrieve 5000 lexicographical records at a time, parse through the list, and repeat until you found the blobs you wanted. Some users also resorted to managing a separate lookup table to find specific objects. These separate tables can get out-of-sync—increasing cost, complexity, and frustration. Customers should not have to worry about data organization or index table management, and instead focus on building powerful applications to grow their business.
Blob Index alleviates the data management and querying problem with support for all blob types (Block Blob, Append Blob, and Page Blob). Blob Index is exposed through a familiar blob storage endpoint and APIs, allowing you to easily store and access both your data and classification indices on the same service to reduce application complexity.
To populate the blob index, you define key-value tag attributes on your data, either on new data during upload or on existing data already in your storage account. These blob index tags are stored alongside your underlying blob data. The blob indexing engine then automatically reads the new tags, indexes them, and exposes them to a user-queryable blob index. Using the Azure portal, REST APIs, or SDKs, you can then issue a FindBlobsByTags API call specify a set of criteria. Blob storage will return a filtered result set consisting only of the blobs that met the match criteria.
The below scenario is an example of how Blob Index works:
In a storage account container with a million blobs, a user uploads a new blob “B2” with the following blob index tags: < Status = Unprocessed, Quality = 8K, Source = RAW >.
The blob and its blob index tags are persisted to the storage account and the account indexing engine exposes the new blob index shortly after.
Later on, an encoding application wants to find all unprocessed media files that are at least 4K resolution quality. It issues a FindBlobs API call to find all blobs that match the following criteria: < Status = Unprocessed AND Quality >= 4K AND Status == RAW>.
The blob index quickly returns just blob “B2,” the sole blob out of one million blobs that matches the specified criteria. The encoding application can quickly start its processing job, saving idle compute time and money.
Platform feature integrations with Blob Index
Blob Index not only helps you categorize, manage, and find your blob data but also provides integrations with other Blob service features, such as Lifecycle management.
Using the new blobIndexMatch as a filter, you can move data to cooler tiers or delete data based on the tags applied to your blobs. This allows you to be more granular in your rules and only move or delete data if they match your specified criteria.
The following sample lifecycle management policy applies to block blobs in the “videofiles” container and tiers objects to archive storage after one day only if the blobs match the blob index tag of Status = ‘Processed’ and Source = ‘RAW’.
Lifecycle management rule with blobIndexMatch example.
Lifecycle management integration with Blob Index is just the beginning. We will be adding more integrations with other blob platform features soon!
Conditional blob operations with Blob Index tags
In REST versions 2019-10-10 and higher, most blob service APIs now support a new conditional header, x-ms-if-tags, so that the operation will only succeed if the specified blob index tags condition is met. If the condition is not met, the operation will fail, thus not modifying the blob. This functionality by Blob Index can help ensure data operations only occur on explicitly tagged blobs and can protect against inadvertent deletion or modification by multi-threaded applications.
How to get started
To enroll in the Blog Index preview, submit a request to register this feature to your subscription by running the following PowerShell or CLI commands:
Register by using PowerShell
Register-AzProviderFeature -FeatureName BlobIndex -ProviderNamespace Microsoft.Storage
Register-AzResourceProvider -ProviderNamespace Microsoft.Storage
Register by using Azure CLI
az feature register --namespace Microsoft.Storage --name BlobIndex
​az provider register --namespace 'Microsoft.Storage'
After your request is approved, any existing or new General-purpose v2 (GPv2) storage accounts in France Central and France South can leverage Blob Index’s capabilities. As with most previews, we recommend that this feature should not be used for production workloads until it reaches general availability.
Ref:
https://azure.microsoft.com/en-gb/blog/manage-and-find-data-with-blob-index-for-azure-storage-now-in-preview/

Limits on File Count for Azure Blob Storage

Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.
I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage

How to determine weather Azure storage is Page or Block Blob type?

I'm trying to configure an online backup to an Azure Storage account. Some of the files I am backing up are larger than 200GB, so I have to be using page Blob type storage.
I believe that, at the moment, this is the kind of storage I have configured; However, my backup of the files that are larger than this 200GB fails stating that the "block blob maximum size is 200GB."
How can I check what kind of storage my Azure storage is configured as? And, how can i ensure that in the future, I am configuring the correct type of storage?
An Azure Storage account can contain Block, Append and Page blobs in a same container. We do not any configurations on Account level or container level. The difference is we will need to use different APIs in SDK or implement with different REST APIs for the different type of Blobs.
You can refer to https://msdn.microsoft.com/en-us/library/azure/dd135733.aspx for more info.
And according to your requirement, for those blobs will be larger than 200GB. You can divide them into several pieces of block blobs, and you can custom mimetype of the blobs pieces to determine whether they are the piece of a special file.
Any further concern, please feel free to let me know.
It depends on how you upload the files to Azure Storge, you specify what type of blob you want to create, either Page, Blob or Append Blob.
Ex:
CloudPageBlob blob = container.GetPageBlobReference("file name");
blob.Properties.ContentType = "binary/octet-stream";
blob.Create(size)
Then you have to divide your stream into pages and iterate over it and upload it to the blob.

Query blobs in Blob storage

I have serialized text data that is stored in a blob inside Azure blob storage. The text is basically key/value data. I am wondering if there is a way to easily query the blob without exploding the data into another table/database or pulling the blob into memory?
Azure Blob storage has no API to query data within the blob - it's just dumb storage. See here for the Blob Storage API. You're essentially stuck reading, deserializing and grabbing your value(s).
Perhaps Azure table storage would be a better fit for this application? That at least keeps things in the realm of an Azure storage account rather than needing to pull in a SQL Server instance.
One option you could consider is to use Data Lake Analytics, as it supports Azure Blobs as data source.
Depending on what your preferred way of accessing the data is, you can use PowerShell, .NET SDK etc. to query the data...

Resources