A little bit of background first. Maybe, hopefully, this can save some else some trouble and frustration. Move down to the TL;DR to move on to the actual question.
We currently have a couple genetics workflows related to gene sequencing running in azure batch. Some of which are quite light and I'd like to move them to an Azure Function running a Docker container. For this purpose I have created a docker image, based on the azure-functions image, containing anaconda with the necessary packages to run our most common, lighter workflows. My initial attempt produced a huge image of ~8GB. Moving to Miniconda and a couple other adjustments has reduced the image size to just shy of 3.5GB. Still quite large, but should be manageable.
For this function I created an Azure Function running on an App Service Plan on the P1V2 tier on the belief that I would have 250GB storage to work with as stated in the tier description:
I encountered some issues with loading my first image (the large one) after a couple fixes where the log indicated that there was no more space left on the device. This puzzled me since the quota stated that I'd used some 1.5MB of the 250 total. At this point I reduced the image size and could at least successfully deploy the image again. Enabling SSH support I logged in to the container via SSH and ran df -h.
Okay so the function does not have the advertised 250GB of storage available runtime. It only has about 34. I spent some time searching in the documentation but could not find anything indicating that this should be the case. I did find this related SO question which clarified things a bit. I also found this still open issue on the azure functions github repo. Seems that more people are having the same issue and is not aware of the local storage limitation of the SKU. I might have overlooked something so if this is in fact documented I'd be happy if someone could direct me there.
Now the reason I need some storage is that I need to get the raw data file which can be anything from a handful of MBs to several GBs. And the workflow then subsequently produces multiple files varying between a few bytes and several GBs. The intention was, however, not to store this on the function instances but to complete the workflow and then store the resulting files in a blob storage.
TL;DR
You do not get the advertised storage capacity for functions running on an App Service Plan on the local instance. You get around 20/60/80GB depending on the SKU.
I need 10-30GB of local storage temporarily until the workflow has finished and the resulting files can be stored elsewhere.
How can I reduce the spent storage on the local instance?
Finally, the actual question. You might have noticed on the screenshot from the df -h command that of the available 34GB a whopping 25GB is already used. Which leaves 7.6GB to work with. I already mentioned that my image is ~3.5GB of size. So how come there is a total of 25GBs used and is there any change at all to reduce this aside from shrinking my image? That being said, if I'd removed my image completely (freeing 3.5GB of storage) it would still not be quite enough. Maybe the function simply needs stuff worth over 20GB of storage to run?
Note: It is not a result of cached docker layers or the like since I have tried scaling the app service plan which clears the cached layers/images and re-downloads the image.
Moving up a tier gives me 60GB of total available storage on the instance. Which is enough. But it feels very overkill when I don't need the rest that this tier offers.
Attempted solution 1
One thing I have tried, which might be of help to others, is mounting a file share on the function instance. This can be done with very little effort as shown by the MS docs. Great, now I could directly write to a file share saving me some headache and finally move on. Or so I thought. While this mostly worked great it still threw an exception indicating that it ran out space on the device at some point leading me to believe that it may be using local storage as temporary storage, buffer, or whatever. I will continue looking into it and see if I can figure that part out.
Any suggestions or alternative solutions will be greatly appreciated. I might just decide to move away from Azure Functions for this specific workflow. But I'd still like to clear things up for future reference.
Thanks in advance
niknoe
Related
azure experts
I am a DS recently trying to manage the pipelines after our DE left, so quite new to azure. I recently ran into an issue that our azure storage usage increased quite significantly, I found some clues, but now stuck and will be really grateful if an expert can help me through.
Looking into the usage, it seems the usage increase to be mainly due to a significant increase of transaction and ingress of the API ListFilesystemDir.
I traced the usage profile back to the time when our pipelines run, I located the main usage to be from a copy data activity, which copies data from a folder in ADLS to SQL DW staging table. The amount of data being actually transferred is only tens of MB, similar to before, so should not be the reason leading to the substantial increase. The main difference I found is in the transaction and ingress using the API ListFilesystemDir, which shows an ingress volume of hundreds of MB and ~500k transactions during the pipeline run. The surprising thing is, the usage of this API ListFilesystemDir was very small before (at tens of kB and ~1k transaction).
recent storage usage of API ListFilesystemDir
Looking at the pipeline, I notice it uses a filepath wildcard to filter and select all the most recently updated files within a higher-level folder.
pipeline
My questions are
Is there any change to the usage/limit/definition of API ListFilesystemDir that could cause such an issue?
Does the application of wildcard path filter means azure need to query all the file lists in the root folder and then do filter, such it leads to a huge amount the transaction? Although it didn't seem to cause the issue before, has there been any change to the API usage of this wildcard filter?
Any suggestion on solving the problem?
Thank you very much!
Situation:
A user with a TB worth of files on our Azure blob storage and gigabytes of storage in our Azure databases decides to leave our services. At this point, we need to export all his data into 2GB packages and deposit them on the blob storage for a short period (two weeks or so).
This should happen very rarely, and we're trying to cut costs. Where would it be optimal to implement a task that over the course of a day or two downloads the corresponding user's blobs (240 KB files) and zips them into the packages?
I've looked at a separate webapp running a dedicated continuous webjob, but webjobs seem to shut down when the app unloads, and I need this to hibernate and not use resources when not up and running, so "Always on" is out. Plus, I can't seem to find a complete tutorial on how to implement the interface, so that I may cancel the running task and such.
Our last resort is abandoning webapps (three of them) and running it all on a virtual machine, but this comes up to greater costs. Is there a method I've missed that could get the job done?
This sounds like a job for a serverless model on Azure Functions to me. You get the compute scale you need without paying for idle resources.
I don't believe that there are any time limits on running the function (unlike AWS Lambda), but even so you'll probably want to implement something to split the job up first so it can be processed in parallel (and to provide some resilience to failures). Queue these tasks up and trigger the function off the queue.
It's worth noting that they're still in 'preview' at the moment though.
Edit - have just noticed your comment on file size... that might be a problem, but in theory you should be able to use local storage rather than doing it all in memory.
Whilst accepting that Backups in Windows Azure Websites are a preview feature, I can't seem to get them working at all. My site is approximately 3GB and on the standard tier. The settings are configured to move to a Geo-Redundant storage account with no other containers. There is no database selected, I'm only backing up the files.
In the Admin Portal, if I use the manual Backup Now button, a 0 bytes file is created within the designated storage account, dated 01/01/0001 00:00:00. However even after several days, it is not replaced with the 'actual' file.
If I use the automated backup scheduler, nothing happens at all - no errors, no 0 byte files.
Can anyone shed any light on this please?
The backup/restore feature is still in a preview mode and officially supports only 2 GB of data. From the error message you posted ("backup is currenly in progress") it seems you probably hit a bug which was there and was fixed last week (the result of that bug was that there were some lingering backups which blocked subsequent backups).
Please try it again, you should be able to invoke it now. If you find another error message in operational logs, feel free to post it here (just leave the RequestId in it unscrambled - we can correlate using that) and we can take a look.
However, as I mentioned in the beginning, more than 2 GBs are not fully supported yet (you might not be able to do e.g. roundtrip with your data - backup and then restore).
Thanks,
Petr
The max size for Windows Azure drive is 1T
M$ only charged for the data in it, not for the size
My question is: why not just create an Azure Drive at size 1T, so no more worries about resize etc.
Or there has catch if I create a Drive bigger than I need.
I often do that when creating an Azure Drive: allocating a maximum size drive of 1TB. No discernable penalty. The only advantage to setting a smaller size: protecting yourself against cost overruns. There might be a possibility it takes longer to initialize a 1TB drive, but I haven't measured it.
I have not yet found a lot of use for Azure Drives given some of the limitations that they have and the other storage options that are available, so I have only done some playing with them, not actually used one in a production environment.
With that said, based on my understanding, and the description you give in your question about only being charged for the amount of content stored on the drive I do not see any issue with creating a large drive initially and growing into it in the future.
Hope that helps some, even if it is just a - yes I think you understand it correctly!
The reason is pretty simple if you tried it. Namely, while you are not charged except for the data inside the drive, it does count against your quota limit. So, if every drive was 1TB, then you could create only 99 drives (think overhead here) before your storage account quota was gone. Also, yes, it does take longer to create a 1TB drive versus a smaller one (in practice).
I have an existing Azure CloudDrive that I want to make bigger. The simplist way I can think of is to creating a new drive and copying everything over. I cannot see anyway to just increase the size of the vhd. Is there a way?
Since an Azure drive is essentially a page blob, you can resize it. You'll find this blog post by Windows Azure Storage team useful regarding that: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/04/11/using-windows-azure-page-blobs-and-how-to-efficiently-upload-and-download-page-blobs.aspx. Please read the section titled "Advanced Functionality – Clearing Pages and Changing Page Blob Size" for sample code.
yes you can,
please i know this program, is ver easy for use, you can connect with you VHD and create new, upload VHD and connect with azure, upload to download files intro VHD http://azuredriveexplorer.codeplex.com/
I have found these methods so far:
“the soft way”: increase the size of the page blob and fix the
VHD data structure (the last 512 bytes).
Theoretically this creates unpartitioned disk space after the
current partition. But if the partition table also expects
metadata at the end of the disk (GPT? or Dynamic disks), that
should be fixed as well.
I'm aware of only one tool
that can do this in-place modification. Unfortunately this tool is
not much more than a one-weekend hack (at the time of this writing)
and thus it is fragile. (See the disclaimer of the author.) But fast.
Please notify me (or edit this post) if this tool gets improved significantly.
create a larger disk and copy everything over, as you've suggested.
This may be enough if you don't need to preserve NTFS features like
junctions, soft/hard links etc.
plan for the potential expansion and start with a huge (say 1TB) dynamic VHD,
comprised of a small partition and lots of unpartitioned (reserved) space.
Windows Disk Manager will see the unpartitioned space in the VHD, and can expand the
partition to it whenever you want -- an in-place operation. The subtle point is
that the unpartitioned area, as long as unparitioned, won't be billed, because
isn't written to. (Note that either formatting or defragmenting does allocate
the area and causes billing.)
However it'll count against the quota of your Azure Subscription (100TB).
“the hard way”: download the VHD file, use a VHD-resizer program to insert unpartitioned disk space, mount the
VHD locally, extend the partition to the unpartitioned space, unmount,
upload.
This preserves everything, even works for an OS partition, but is very
slow due to the download/upload and software installations involved.
same as above but performed on a secondary VM in Azure. This speeds up
downloading/uploading a lot. Step-by-step instructions are available here.
Unfortunately all these techniques require unmounting the drive for quite a lot of time, i.e. cannot be performed in high-available manner.