What is the benefit of having linked storage account for HDInsight cluster? - azure-hdinsight

For an HDInsight cluster there has to be at least one azure storage account which is its default storage account -- it is required so that it is treated as its fs (filesystem). This I get. But what about optional linked azure storage accounts? From ADF (Azure Data Factory) perspective at least, do we need to have a storage account added as linked storage account to an HDInsight cluster? Anyway the Azure storage account is accessible purely by providing just two pieces of information --- the account name and the key. Both these things are specified in Linked Servers in ADF. This guarantees the access of the storage account. What is the real benefit of having some account added as linked storage account, from ADF point of view or otherwise? Basically, what I am asking is -- is there anything that we can't do purely using account name and key without adding the account as linked storage for the given HDInsight cluster?

The main reason to have additional accounts is because they have limits. A storage account can have 500 TB of data in it and 20000 request per second. Depending on the size of your cluster and work load you might hit the request limit. If you are worried about those limits and you don't want to manage alot of storage accounts you should look into Azure Data Lake.

I think I sort of figured out the answer. With linked storage accounts the cluster, when used as a compute, can directly access BLOBS on those storage accounts without requiring us to separately specify the storage keys in queries. That's the use case for which linked storage is a must have.

Related

Terraform State File in storage account

We have Terraform state file stored in the Azure Storage Account. In case storage account went down we will be screwed. What is the best way to store the file? where?
AFAIK, there are two methods to store a terraform state file i.e. Locally in your machine or in a Storage account in azure .
In case storage account went down we will be screwed. What is the best
way to store the file? where?
As confirmed , You are using Standard_LRS which is not preferred as per the Microsoft Document if you are looking for high availability.
Locally redundant storage (LRS) copies your data synchronously three
times within a single physical location in the primary region. LRS is
the least expensive replication option, but is not recommended for
applications requiring high availability or durability.
So, as a solution you can change the storage account type as per your requirement to Standard_GRS or Standard_ZRS so that your data is present in two locations i.e. replicated.
You can change it by going to your storage account>>Configuration>>replication as shown below:
If You want more details on Disaster recovery (if one location is down) or data protection from Accidental Deletes then please refer the below documents:
Disaster recovery and storage account failover - Azure Storage | Microsoft Docs
Soft delete for containers - Azure Storage | Microsoft Docs

How to increase availability for Azure storage account?

For Azure storage accounts the SLA for write requests is 99.9% regardless if I'm using LRS, ZRS, GRS or RA-GRS. Is there a way to increase the SLA for write requests on the storage account?
E.g is there a good way to fail over to another storage account in another region?
The accounts don't have to contain the same data. I just want to always be able to store the blobs.
Is there a good way to fail over to another storage account in another region?
Of course, Azure storage itself provides this feature, you can refer to the document Initiate a storage account failover.
Before you can perform an account failover on your storage account, make sure that your storage account is configured for geo-replication. Your storage account can use any of the following redundancy options:
Geo-redundant storage (GRS) or read-access geo-redundant storage (RA-GRS)
Geo-zone-redundant storage (GZRS) or read-access geo-zone-redundant storage (RA-GZRS)
You will see an interface like this:
Is there a way to increase the SLA for write requests on the storage account?
My suggestion is to increase the number of retries for write requests, maybe the example here can help you, you can use BlobClientOptions.

Azure blob container backup and recovery

I am thinking of using Azure Blob Storage for document management system which I am developing. All Blobs ( images,videos, word/excel/pdf etc) will be stored in Azure Blob storage. As I understand, I need to create container and these files can be stored within the container.
I would like to know how to safeguard against accidental/malicious deletion of the container. If a container is deleted, all the files it contains will be lost. I am trying to figure out how to put backup and recovery mechanism in place for my storage account so that it is always guaranteed that if something happens to a container, I can recover files inside it.
Is there any way provided by Microsoft Azure for such backup and recovery or Do I need explicitly write a code in such a way that files are stored in two separate Blob storage account.
Anyone with access to your storage account's key (primary or secondary; there are two keys for a storage account) can manipulate the storage account in any way they see fit. The only way to ensure nothing happens? Don't give anyone access to the key(s). If you place the storage account within a resource group that only you have permissions on, you'll at least prevent others with access to the subscription from discovering the storage account and accessing it.
Within the subscription itself, you can place a lock on the actual resource (the storage account), so that nobody with access to the subscription accidentally deletes the entire storage account.
Note: with storage account keys, you do have the ability to regenerate the keys at any time. So if you ever suspected a key was compromised, you can perform a re-gen action.
Backups
There are several backup solutions offered for blob storage in case if containers get deleted.more product info can be found here:https://azure.microsoft.com/en-us/services/backup/
Redundancy
If you are concerned about availability, "The data in your Microsoft Azure storage account is always replicated to ensure durability and high availability. Replication copies your data, either within the same data center, or to a second data center, depending on which replication option you choose." , there are several replication options:
Locally redundant storage (LRS)
Zone-redundant storage (ZRS)
Geo-redundant storage (GRS)
Read-access geo-redundant storage (RA-GRS)
More details can be found here:
https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy
Managing Access
Finally, managing access to your storage account would be the best way to secure and ensure you'll avoid any loss on your data. You can provide read access only if you don't want anyone to delete files,folders etc.. through the use of SAS: Shared Access Signatures, allows you to create policies and provide access based on Read, Write, List, Delete, etc.. A quick GIF demo can be seen here: https://azure.microsoft.com/en-us/updates/manage-stored-access-policies-for-storage-accounts-from-within-the-azure-portal/
We are using blob to store documents and for documents management.
To prevent deletion of the blob, you can now enable soft deletion as described in here:
https://azure.microsoft.com/en-us/blog/soft-delete-for-azure-storage-blobs-ga/
You can also create your own automation around powershell,azcopy to do incremental and full backups.
The last element would be to use RA-GRS blobs where you can read from a secondary blob in read mode in another region in case the data center goes down.
Designing Highly Available Applications using RA-GRS
https://learn.microsoft.com/en-us/azure/storage/common/storage-designing-ha-apps-with-ragrs?toc=%2fazure%2fstorage%2fqueues%2ftoc.json
Use Microsoft's Azure Storage Explorer. It will allow you to download the full contents of blob containers including folders and subfolders with blobs. Conversely, you can upload to containers in the same way. Simple and free!

Azure Blob Storage: Does Microsoft Implement Redundant Backups?

I've searched the web and contacted technical support yet no one seems to be able to give me a straight answer on whether items in Azure Blob Storage are backed up or not.
What I mean is, do I need to create a twin storage account as a "backup" and program copies of all content from one storage to another, or are the contents of a client's Blob Storage automatically redundantly backed up by Microsoft?
I know with AWS, storage is redundantly backed up via onsite drives as well as across other nodes in the cluster.
do I need to create a twin storage account as a "backup" and program
copies of all content from one storage to another, or are the contents
of a client's Blob Storage automatically redundantly backed up by
Microsoft?
Yes, you will need to do backup manually. Azure Storage does not back up the contents of your storage account automatically.
Azure Storage does provide geo-redundant replication (provided you configure the redundancy level for your storage account as GRS or RA-GRS) but that is not back up. Once you delete content from your primary account (location, it will automatically be removed from secondary account (geo-redundant location).
Both AWS (EBS) and Azure(Blob Storage) options provides durability by replicating the data across different data centers. This is for the high availability and durability of the data to provide the guarantee by the cloud provider.
In order to ensure that your data is durable, Azure Storage has the
ability to keep (and manage) multiple copies of your data. This is
called replication, or sometimes redundancy. When you set up your
storage account, you select a replication type. In most cases, this
setting can be modified after the storage account is set up.
For more details refer the replication section in documentation.
If you need to capture changes to the storage and allow restore to previous versions (e.g In situations like data corruption or application feature requirements like restore points, backups), you need to take a SnapShot manually. This is common for both AWS and Azure.
For more details on creating a Snapshot of Blob in Azure refer the documentation.

Azure - Multiple Cloud Services, Single Storage Account

I want to create a couple of cloud services - Int, QA, and Prod. Each of these will connect to separate Db's.
Do these cloud services require "storage accounts"? Conceptually the cloud services have executables and they must be physically located somewhere.
Note: I do not use any blobs/queues/tables.
If so, must I create 3 separate storage accounts or link them up to one?
Storage accounts are more like storage namespaces - it has a url and a set of access keys. You can use storage from anywhere, whether from the cloud or not, from one cloud service or many.
As #sharptooth pointed out, you need storage for diagnostics with Cloud Services. Also for attached disks (Azure Drives for cloud services), deployments themselves (storing the cloud service package and configuration).
Storage accounts are free: That is, create a bunch, and still only pay for consumption.
There are some objective reasons why you'd go with separate storage accounts:
You feel that you could exceed the 20,000 transaction/second advertised limit of a single storage account (remember that storage diagnostics are using some of this transaction rate, which is impacted by your logging-aggressiveness).
You are concerned about security/isolation. You may want your dev and QA folks using an entirely different subscription altogether, with their own storage accounts, to avoid any risk of damaging a production deployment
You feel that you'll exceed 200TB 500TB (the limit of a single storage account)
Azure Diagnostics uses Azure Table Storage under the hood (and it's more convenient to use one storage account for every service, but it's not required). Other dependencies your service has might also use some of the Azure Storage services. If you're sure that you don't need Azure Storage (and so you don't need persistent storage of data dumped through Azure Diagnostics) - okay, you can go without it.
The service package of your service will be stored and managed by Azure infrastructure - that part doesn't require a storage account.

Resources