How are BLOBs stored in Azure? - azure

I did some research to compare file storage in SharePoint and file storage in Azure.
As far as I know, SP uses a SQL database to store everything. So in fact, when I put a BLOB into SP, it ends up in a SQL, as it is mentioned here. So there are some disadvantages of storing BLOBs in SP:
Write operations are particularly problematic because the BLOB is written twice—first to the transaction log for transactional consistency then written to the appropriate table in the SQL content database.
Boiling down a lot of data, it’s pretty clear that files greater than 1MB perform better (reads and writes) when the BLOB is externalized,
Now I wonder if there are the same disadvantages with Azure BLOB Storage: Are they also end up inside a database? Do I have the same disadvantages?

In short the answer is no. The BLOB is not stored in a SQL database in Azure Storage.
The paper below gives more insights to the internals of Azure Storage. Do read:
http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf

Related

Use Data Lake or Blob on HDInsights cluster on Azure

When creating a HDInsights Hadoop cluster in Azure there are two storage options. Either Azure Data Lake Store (ADLS) or Azure Blob Storage.
What are the real differences between these two options and how do they affect the performance?
I found this page https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage
But it is not very specific, only uses very general terms like "ADLS is optimized for analytics".
Does it mean that its better for storing the HDInsights file system? And if ADLS is indeed faster then why not use it for non-analytics data as well?
As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.
Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.
Hope this helps.
In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.
Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.
The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.
In addition to the other answers its not possible to use the Spark Data Factory activity on HDInsights clusters that use Data Lake as the primary storage. This limitation applies to both ADFv1 and v2 as seen here: https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-spark and https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-spark

Limits on File Count for Azure Blob Storage

Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.
I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage

Is Azure Blob storage the right place to store many (small) communication logs?

I am working with a program which connects to multiple APIs, the logs for each operation (HTML/XML/Json) need to be stored for possible later review. Is it feasible to store each request/reply in an Azure blob? There can be hundreds of requests per second (all of which need storing) which vary in size and have an average size of 100kB.
Because the logs need to be searchable (by metadata) my plan is to store it in Azure Blob and put metadata (with blob locations, custom application-related request and content identifiers, etc) in an easily-searchable database.
You can store logs in the Azure table storage or Blob storage but Microsoft itself recommends using Blob storage. Azure Storage Analytics stores log data in Blob storage.
This 'Azure Storage Table Design Guide' points out several draw backs of using table storage for logs and also provides details on how to use the blob storage to store logs. Read the 'Log data anti-pattern' section in particular for this use case.

Query blobs in Blob storage

I have serialized text data that is stored in a blob inside Azure blob storage. The text is basically key/value data. I am wondering if there is a way to easily query the blob without exploding the data into another table/database or pulling the blob into memory?
Azure Blob storage has no API to query data within the blob - it's just dumb storage. See here for the Blob Storage API. You're essentially stuck reading, deserializing and grabbing your value(s).
Perhaps Azure table storage would be a better fit for this application? That at least keeps things in the realm of an Azure storage account rather than needing to pull in a SQL Server instance.
One option you could consider is to use Data Lake Analytics, as it supports Azure Blobs as data source.
Depending on what your preferred way of accessing the data is, you can use PowerShell, .NET SDK etc. to query the data...

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources