The storage consumption on our ADLS Gen2 rose from 5 TB to 314 TB within 10 days and has maintained steady since then. It has just 2 containers:- $logs container and a container with all directories for data storage. The $logs container looks empty. I have tried looking at Folder Statistics in Azure Storage Explorer on the other container and it does not seem any of the directories is big enough.
Interestingly, one of the directories was running the Folder Statistics for few hours so I cancelled it. On cancellation, partial result showed 200+ TB and 88k+ blobs in it. I did a visual inspection of the directory and there were just a handful of blobs that would barely sum up to 1 GB. This directory had been present for months without issue. Regardless, I deleted this directory and checked the storage consumption after a few hours but could not see any change.
This brings to questions:-
If I cancel an ongoing Folder Statistic, could it show an incorrect partial result (in the above case it showed 200TB whereas it looked barely 1 GB in reality)? I have done it on previous occasions but even the partial stats seemed feasible.
Could there be hidden blobs in ADLS Gen2 that might not show up on visual inspection? (I have Read, Write, Delete access if that matters)
I have run Folder Statistic on Azure Storage Explorer for all folders individually. But is there a better way to get the storage consumption at one go (at least classified for directory and their sub-directory level - I suppose blob level would be overkill but whatever works). I have access to Databricks with mount point to this container and can create a cluster with the required runtime if such code is specific to one.
For reference:-
Update: We found the cause of the increase. It was, in fact, a few copy activities created by our team.
Interestingly, when we deleted it, it took about 48 hrs before the storage graph actually started going down although the files disappeared immediately.
This was not a delay in re4freshing the consumption graph, rather it actually took that time before we saw expected sharp dip in storage.
We raised a Microsoft case and they confirmed that such an amount of data can take time to actually delete in the background.
Related
I need to load data from different files into an Azure SQL database. So I set up a VM running Airflow and two Azure File Shares, one for my dags (so that I can modify them without sshing into the VM) and another to drop the files that will be loaded.
I mounted those two fileshares to the VM and my PC and use them as normal drives.
The system is currently idling and I can see in Azure's portal that I'm getting about 24k transactions every 5 minutes, but I can't see specifically what is generating them.
Is it possible the VM is constantly requesting a list of files or touching the fileshare to check if it's still there? How can I avoid this?
Thanks!
I can confirm that having the dags folder in a shared drive was the cause of the insane amount of transactions. I moved the dags folder to the VM drive and now everything is back to normal.
I was running into a similar issue, having 8k transactions every 5 minutes for just 3 DAGs. I got it down to about 800 transactions every 5 minutes by setting file_parsing_sort_mode to alphabetical.
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode
The default setting for this, which is modified_time would make the DAG processor retrieve the last modified time of the file from the fileshare on every loop. Weirdly, this action even triggers write operations which are more costly than read operations.
https://github.com/apache/airflow/blob/2d79d730d7ff9d2c10a2e99a4e728eb831194a97/airflow/dag_processing/manager.py#L982-L1008
Same answer posted on a similar question here: https://stackoverflow.com/a/70524563/6654620
A little bit of background first. Maybe, hopefully, this can save some else some trouble and frustration. Move down to the TL;DR to move on to the actual question.
We currently have a couple genetics workflows related to gene sequencing running in azure batch. Some of which are quite light and I'd like to move them to an Azure Function running a Docker container. For this purpose I have created a docker image, based on the azure-functions image, containing anaconda with the necessary packages to run our most common, lighter workflows. My initial attempt produced a huge image of ~8GB. Moving to Miniconda and a couple other adjustments has reduced the image size to just shy of 3.5GB. Still quite large, but should be manageable.
For this function I created an Azure Function running on an App Service Plan on the P1V2 tier on the belief that I would have 250GB storage to work with as stated in the tier description:
I encountered some issues with loading my first image (the large one) after a couple fixes where the log indicated that there was no more space left on the device. This puzzled me since the quota stated that I'd used some 1.5MB of the 250 total. At this point I reduced the image size and could at least successfully deploy the image again. Enabling SSH support I logged in to the container via SSH and ran df -h.
Okay so the function does not have the advertised 250GB of storage available runtime. It only has about 34. I spent some time searching in the documentation but could not find anything indicating that this should be the case. I did find this related SO question which clarified things a bit. I also found this still open issue on the azure functions github repo. Seems that more people are having the same issue and is not aware of the local storage limitation of the SKU. I might have overlooked something so if this is in fact documented I'd be happy if someone could direct me there.
Now the reason I need some storage is that I need to get the raw data file which can be anything from a handful of MBs to several GBs. And the workflow then subsequently produces multiple files varying between a few bytes and several GBs. The intention was, however, not to store this on the function instances but to complete the workflow and then store the resulting files in a blob storage.
TL;DR
You do not get the advertised storage capacity for functions running on an App Service Plan on the local instance. You get around 20/60/80GB depending on the SKU.
I need 10-30GB of local storage temporarily until the workflow has finished and the resulting files can be stored elsewhere.
How can I reduce the spent storage on the local instance?
Finally, the actual question. You might have noticed on the screenshot from the df -h command that of the available 34GB a whopping 25GB is already used. Which leaves 7.6GB to work with. I already mentioned that my image is ~3.5GB of size. So how come there is a total of 25GBs used and is there any change at all to reduce this aside from shrinking my image? That being said, if I'd removed my image completely (freeing 3.5GB of storage) it would still not be quite enough. Maybe the function simply needs stuff worth over 20GB of storage to run?
Note: It is not a result of cached docker layers or the like since I have tried scaling the app service plan which clears the cached layers/images and re-downloads the image.
Moving up a tier gives me 60GB of total available storage on the instance. Which is enough. But it feels very overkill when I don't need the rest that this tier offers.
Attempted solution 1
One thing I have tried, which might be of help to others, is mounting a file share on the function instance. This can be done with very little effort as shown by the MS docs. Great, now I could directly write to a file share saving me some headache and finally move on. Or so I thought. While this mostly worked great it still threw an exception indicating that it ran out space on the device at some point leading me to believe that it may be using local storage as temporary storage, buffer, or whatever. I will continue looking into it and see if I can figure that part out.
Any suggestions or alternative solutions will be greatly appreciated. I might just decide to move away from Azure Functions for this specific workflow. But I'd still like to clear things up for future reference.
Thanks in advance
niknoe
I found quite some answers to copy blobs between azure storage accounts. I know of the cmdlet using Start-AzureStorageBlobCopy. However, I have > 20 million files to copy between two storage accounts in the same data center and it seems to take forever (it is copying since more than a week) since it starts each file copy process separately.
Furthermore, I found that in the most current version of the Azure tools (7.4), the cmdlet downloads the full file list (to memory) and only then starts with the copy process. So it not only takes forever but uses large amount of memory. The same is also true if I use AzCopy.
Thus my question: what is a good possibility (that actually really works!) to copy large amounts of files of which each is not that big between two storage accounts in the same data center? Or maybe you know of parameters to set when using the cmdlets (the documentation is awful and not updated)?
I'm trying to figure out the best performing approach when writing thousands of small Blobs to Azure Storage.
The application scenario is the following:
thousands of files are being created or overwritten by a constantly
running windows service installed on a Windows Azure VM
Writing to the Temporary Storage available to the VM, the service can reach more
than 9,000 file creations per second
file sizes range between 1 KB and 60 KB
on other VMs with same sw running, other files are being created with same rate and criteria
given the need to build and keep updated a central repository, another service running on each VM copies the newly created files from the Temporary Storage to Azure Blobs
other servers should then read the Azure Blobs in their more recent version
Please note that for many constraints that I'm not listing for shortness, it's not currently possible to modify the main service to directly create Blobs instead of files on Temporary file system. ...and from what I' currently seeing it would mean a slower rate of creation, not acceptable per original requirements.
This copy operation, that I'm testing in a tight loop on 10,000 files, seems to be limited at 200 blob creations per second. I've been able to reach this result after tweaking the sample code named "Windows Azure ImportExportBlob" found here: http://code.msdn.microsoft.com/windowsazure/Windows-Azure-ImportExportB-9d30ddd5 with the async suggestions found in this answer: Using Parallel.Foreach in a small azure instance
I obtained this apparent maximum of 200 blob creations per second on an extralarge VM with 8 cores and setting the "maxConcurrentThingsToProcess" Semaphore accordingly. The network utilization during the test is max 1% of the available 10Gb shown in task manager. This means roughly 100 Mb of the 800 Mb that should be available on that VM size.
I see that the total size copied during the elapsed time gives me around 10 MB/sec.
Is there some limitation on the Azure Storage traffic you can generate or should I use a different approach when writing so many and small files ?
#breischl Thank you for the scalability targets. After reading that post, I started searching for more target figures possibly prepared by Microsoft and found 4 posts (too many for my "reputation" to be posted here, the other 3 are part 2, 3 and 4 of the same series):
http://blogs.microsoft.co.il/blogs/applisec/archive/2012/01/04/windows-azure-benchmarks-part-1-blobs-read-throughput.aspx
the first post contains an important hint: "You may have to increase the ServicePointManager.DefaultConnectionLimit for multiple threads to establish more than 2 concurrent connections with the storage."
I've set this to 300 , rerun the test and seen an important increase in the MB/s. As I previously wrote, I was thinking to be hitting a limit in the underlying blob service when "too many" threads are writing blobs. This is the confirmation of my worries. Thus, I removed all the changes made to the code to work with a semaphore and replaced it again with a parallel.for to start as many blob upload operations as possible. The result has been awesome: 61 MB/s writing blobs and 65 MB/s reading.
The scalability target is 60 MB/s and I'm finally happy with the result.
Thank you all again for your answers.
So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?
Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.
This might not apply to your scenario, but it's something to consider...
I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
Blobs – Since the partition key is down to the blob name, we can load
balance access to different blobs across as many servers in order to
scale out access to them. This allows the containers to grow as large
as you need them to (within the storage account space limit). The
tradeoff is that we don’t provide the ability to do atomic
transactions across multiple blobs.
Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).
Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.
Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.
There is also one more factor that get's into this. Price!
Currently operation List and Create container are for the same price:
0,054 US$ / 10.000 calls
Same price is actually for writing the blob.
So in extreme cause you can pay a lot more, if you create and delete many containers
delete is free
you can see the calculator here:
https://azure.microsoft.com/en-us/pricing/calculator/
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning
Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.
Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.