Move large number of small blobs to different azure storage account

Move large number of small blobs to different azure storage account - azure

I found quite some answers to copy blobs between azure storage accounts. I know of the cmdlet using Start-AzureStorageBlobCopy. However, I have > 20 million files to copy between two storage accounts in the same data center and it seems to take forever (it is copying since more than a week) since it starts each file copy process separately.
Furthermore, I found that in the most current version of the Azure tools (7.4), the cmdlet downloads the full file list (to memory) and only then starts with the copy process. So it not only takes forever but uses large amount of memory. The same is also true if I use AzCopy.
Thus my question: what is a good possibility (that actually really works!) to copy large amounts of files of which each is not that big between two storage accounts in the same data center? Or maybe you know of parameters to set when using the cmdlets (the documentation is awful and not updated)?

Related

60x storage consumtion spike in Azure ADLS Gen2 within 10 days

The storage consumption on our ADLS Gen2 rose from 5 TB to 314 TB within 10 days and has maintained steady since then. It has just 2 containers:- $logs container and a container with all directories for data storage. The $logs container looks empty. I have tried looking at Folder Statistics in Azure Storage Explorer on the other container and it does not seem any of the directories is big enough.
Interestingly, one of the directories was running the Folder Statistics for few hours so I cancelled it. On cancellation, partial result showed 200+ TB and 88k+ blobs in it. I did a visual inspection of the directory and there were just a handful of blobs that would barely sum up to 1 GB. This directory had been present for months without issue. Regardless, I deleted this directory and checked the storage consumption after a few hours but could not see any change.
This brings to questions:-
If I cancel an ongoing Folder Statistic, could it show an incorrect partial result (in the above case it showed 200TB whereas it looked barely 1 GB in reality)? I have done it on previous occasions but even the partial stats seemed feasible.
Could there be hidden blobs in ADLS Gen2 that might not show up on visual inspection? (I have Read, Write, Delete access if that matters)
I have run Folder Statistic on Azure Storage Explorer for all folders individually. But is there a better way to get the storage consumption at one go (at least classified for directory and their sub-directory level - I suppose blob level would be overkill but whatever works). I have access to Databricks with mount point to this container and can create a cluster with the required runtime if such code is specific to one.
For reference:-

Update: We found the cause of the increase. It was, in fact, a few copy activities created by our team.
Interestingly, when we deleted it, it took about 48 hrs before the storage graph actually started going down although the files disappeared immediately.
This was not a delay in re4freshing the consumption graph, rather it actually took that time before we saw expected sharp dip in storage.
We raised a Microsoft case and they confirmed that such an amount of data can take time to actually delete in the background.

Azcopy Copy Millions of Small Blob from storage account to your VM

I am trying to copy millions of small csv files from my storage account to a physical machine using the azcopy command, and I noticed that the speed has been very slow.
The format of azcopy command is
Azcopy Copy <Storage_account_source> --recursive --overwrite=True
And the Command is ran from the Physical Machine.
Is there a way where you can make azcopy download multiple blobs concurrently? instead of checking the blob one by one? I believe that's why the speed is dropping to such a low value of 1 mb/second as it's doing checks on these really small blobs one by one. Or if there is another way to increase the speed of this case of blob transfer?

azcopy is highly optimized for throughput using parallel processing etc. I haven't come across any tools that provide you faster download speed overall. The main limiting factors in my experience are usually (obviously) network bandwidth but also CPU power. It uses a lot of compute resources. So can you maybe increase those two on your VM at least for the duration of the download?

How to effectively share data between scale set VMs

We have an application which is quite scalable as it is. Basically you have one or more stateless nodes that all do some independent work of files that are read and written to shared NFS share.
This NFS can be bottleneck but with local deployment customers just buys big enough box to have sufficient performance.
Now we are moving this to Azure and I would like to have a better more "cloudy" way of sharing data :) and running some Linux NFS server isn't ideal scenario if we need to manage them.
Is the Azure Blob storage the right tool for this job (https://azure.microsoft.com/en-us/services/storage/blobs/)?
we need good scalability. e.g. up to 10k files writen in a minute
files are quite small, less than 50KB per file on average
files created and read, not changed
files are short lived, we purge them every day
I am looking for more practical experience with this kind of storage and how good it really is.

There are two possible solutions to your request, either using Azure Storage Blobs (Recommended for your scenario) or Azure Files.
Azure Blobs has the following scaling targets:
It doesn't support the fact of attaching it a server, such as a network share.
Blobs do not support a hierarchy file structure besides having containers (Virtual folders can be accessed, but the con is you can't delete a container if it contains blobs- for the point about purging- but there are methods to do purging using your own code.)
Azure Files:
Links recommended:
Comparison between Azure Files and Blobs:https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
Informative SO post here

Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?

So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?

Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.
This might not apply to your scenario, but it's something to consider...

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
Blobs – Since the partition key is down to the blob name, we can load
balance access to different blobs across as many servers in order to
scale out access to them. This allows the containers to grow as large
as you need them to (within the storage account space limit). The
tradeoff is that we don’t provide the ability to do atomic
transactions across multiple blobs.

Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).
Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.
Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.

There is also one more factor that get's into this. Price!
Currently operation List and Create container are for the same price:
0,054 US$ / 10.000 calls
Same price is actually for writing the blob.
So in extreme cause you can pay a lot more, if you create and delete many containers
delete is free
you can see the calculator here:
https://azure.microsoft.com/en-us/pricing/calculator/

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning
Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.
Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.

How do I increase the size of an Azure CloudDrive?

I have an existing Azure CloudDrive that I want to make bigger. The simplist way I can think of is to creating a new drive and copying everything over. I cannot see anyway to just increase the size of the vhd. Is there a way?

Since an Azure drive is essentially a page blob, you can resize it. You'll find this blog post by Windows Azure Storage team useful regarding that: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/04/11/using-windows-azure-page-blobs-and-how-to-efficiently-upload-and-download-page-blobs.aspx. Please read the section titled "Advanced Functionality – Clearing Pages and Changing Page Blob Size" for sample code.

yes you can,
please i know this program, is ver easy for use, you can connect with you VHD and create new, upload VHD and connect with azure, upload to download files intro VHD http://azuredriveexplorer.codeplex.com/

I have found these methods so far:
“the soft way”: increase the size of the page blob and fix the
VHD data structure (the last 512 bytes).
Theoretically this creates unpartitioned disk space after the
current partition. But if the partition table also expects
metadata at the end of the disk (GPT? or Dynamic disks), that
should be fixed as well.
I'm aware of only one tool
that can do this in-place modification. Unfortunately this tool is
not much more than a one-weekend hack (at the time of this writing)
and thus it is fragile. (See the disclaimer of the author.) But fast.
Please notify me (or edit this post) if this tool gets improved significantly.
create a larger disk and copy everything over, as you've suggested.
This may be enough if you don't need to preserve NTFS features like
junctions, soft/hard links etc.
plan for the potential expansion and start with a huge (say 1TB) dynamic VHD,
comprised of a small partition and lots of unpartitioned (reserved) space.
Windows Disk Manager will see the unpartitioned space in the VHD, and can expand the
partition to it whenever you want -- an in-place operation. The subtle point is
that the unpartitioned area, as long as unparitioned, won't be billed, because
isn't written to. (Note that either formatting or defragmenting does allocate
the area and causes billing.)
However it'll count against the quota of your Azure Subscription (100TB).
“the hard way”: download the VHD file, use a VHD-resizer program to insert unpartitioned disk space, mount the
VHD locally, extend the partition to the unpartitioned space, unmount,
upload.
This preserves everything, even works for an OS partition, but is very
slow due to the download/upload and software installations involved.
same as above but performed on a secondary VM in Azure. This speeds up
downloading/uploading a lot. Step-by-step instructions are available here.
Unfortunately all these techniques require unmounting the drive for quite a lot of time, i.e. cannot be performed in high-available manner.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string