Azure Blob Storage Python API performance - azure

Download ~60MB with python api can take more than 10-15 minutes oven more.
Is there a way/example to imporove the performance?

Thanks for your help!
I found that the container was cold blob storage.
Afer upgrade it to the hot container, the performance increase dramatically.

It could be related to a lot of factor such as slow internet speed, different Geo Location.
If client downloads a file through your Python API, you can definitely improve the overall performance by just passing valet token to client.
Then let the client downloads straight from Azure Blob Storage.

According to your description to compute your actual network bandwidth, it seems to be less than 1Mbps, which 544~816Kbps or 68.3~102.4KB/s for costing 10~15 min to download. Based on this bandwidth if it's truth, I think your case is normal.
Per my experience, for downloading a large blob, if the blob includes multi-blocks,a way to improve the download performance is first to Get Block List of the blob, then to download them concurrently via Get Blob with multithread or multiprocess in Python to write a file with file.seek(offset) method which offset parameter is the size value of block ordered by block id. However, your blob size is just ~60MB which less than 64MB, normally it's a single block. So I think the way is not suitable for this case, besides to improve network bandwidth.

Related

Azure Storage Server Latency experienced when reading the same blob multiple times

I am using an Azure Storage Account to deliver images to a client service and we are experiencing latency for a couple of days, with download times jumping from a few ms to a few seconds for the same blob being downloaded
I read on MSDN here that it can happen when the same blob is queried for reading over and over, which is our case.
Has anyone experienced the same issue before? Can it jump by x 1000 randomly ?
And, what would be the easiest way to fix this problem? It is compromising performances of the client system, and since there is only one client machine located into the same Datacenter, it doesn't really make sense (from my point of view) to set up a CDN.
A potential mitigation is to set up a cache from the client propsepctive as well, but this is a bit complex due to the amount of data being queried and I'd like to solve this issue from the Storage-Side directly.
We use :
A Premium Storage Account
Locally redundant
And a BlockBlobStorage Account Kind
Thanks for your help.

Azcopy Copy Millions of Small Blob from storage account to your VM

I am trying to copy millions of small csv files from my storage account to a physical machine using the azcopy command, and I noticed that the speed has been very slow.
The format of azcopy command is
Azcopy Copy <Storage_account_source> --recursive --overwrite=True
And the Command is ran from the Physical Machine.
Is there a way where you can make azcopy download multiple blobs concurrently? instead of checking the blob one by one? I believe that's why the speed is dropping to such a low value of 1 mb/second as it's doing checks on these really small blobs one by one. Or if there is another way to increase the speed of this case of blob transfer?
azcopy is highly optimized for throughput using parallel processing etc. I haven't come across any tools that provide you faster download speed overall. The main limiting factors in my experience are usually (obviously) network bandwidth but also CPU power. It uses a lot of compute resources. So can you maybe increase those two on your VM at least for the duration of the download?

How to reduce used storage in azure function for container

A little bit of background first. Maybe, hopefully, this can save some else some trouble and frustration. Move down to the TL;DR to move on to the actual question.
We currently have a couple genetics workflows related to gene sequencing running in azure batch. Some of which are quite light and I'd like to move them to an Azure Function running a Docker container. For this purpose I have created a docker image, based on the azure-functions image, containing anaconda with the necessary packages to run our most common, lighter workflows. My initial attempt produced a huge image of ~8GB. Moving to Miniconda and a couple other adjustments has reduced the image size to just shy of 3.5GB. Still quite large, but should be manageable.
For this function I created an Azure Function running on an App Service Plan on the P1V2 tier on the belief that I would have 250GB storage to work with as stated in the tier description:
I encountered some issues with loading my first image (the large one) after a couple fixes where the log indicated that there was no more space left on the device. This puzzled me since the quota stated that I'd used some 1.5MB of the 250 total. At this point I reduced the image size and could at least successfully deploy the image again. Enabling SSH support I logged in to the container via SSH and ran df -h.
Okay so the function does not have the advertised 250GB of storage available runtime. It only has about 34. I spent some time searching in the documentation but could not find anything indicating that this should be the case. I did find this related SO question which clarified things a bit. I also found this still open issue on the azure functions github repo. Seems that more people are having the same issue and is not aware of the local storage limitation of the SKU. I might have overlooked something so if this is in fact documented I'd be happy if someone could direct me there.
Now the reason I need some storage is that I need to get the raw data file which can be anything from a handful of MBs to several GBs. And the workflow then subsequently produces multiple files varying between a few bytes and several GBs. The intention was, however, not to store this on the function instances but to complete the workflow and then store the resulting files in a blob storage.
TL;DR
You do not get the advertised storage capacity for functions running on an App Service Plan on the local instance. You get around 20/60/80GB depending on the SKU.
I need 10-30GB of local storage temporarily until the workflow has finished and the resulting files can be stored elsewhere.
How can I reduce the spent storage on the local instance?
Finally, the actual question. You might have noticed on the screenshot from the df -h command that of the available 34GB a whopping 25GB is already used. Which leaves 7.6GB to work with. I already mentioned that my image is ~3.5GB of size. So how come there is a total of 25GBs used and is there any change at all to reduce this aside from shrinking my image? That being said, if I'd removed my image completely (freeing 3.5GB of storage) it would still not be quite enough. Maybe the function simply needs stuff worth over 20GB of storage to run?
Note: It is not a result of cached docker layers or the like since I have tried scaling the app service plan which clears the cached layers/images and re-downloads the image.
Moving up a tier gives me 60GB of total available storage on the instance. Which is enough. But it feels very overkill when I don't need the rest that this tier offers.
Attempted solution 1
One thing I have tried, which might be of help to others, is mounting a file share on the function instance. This can be done with very little effort as shown by the MS docs. Great, now I could directly write to a file share saving me some headache and finally move on. Or so I thought. While this mostly worked great it still threw an exception indicating that it ran out space on the device at some point leading me to believe that it may be using local storage as temporary storage, buffer, or whatever. I will continue looking into it and see if I can figure that part out.
Any suggestions or alternative solutions will be greatly appreciated. I might just decide to move away from Azure Functions for this specific workflow. But I'd still like to clear things up for future reference.
Thanks in advance
niknoe

Can I create Windows Azure drive as 1T size straight way

The max size for Windows Azure drive is 1T
M$ only charged for the data in it, not for the size
My question is: why not just create an Azure Drive at size 1T, so no more worries about resize etc.
Or there has catch if I create a Drive bigger than I need.
I often do that when creating an Azure Drive: allocating a maximum size drive of 1TB. No discernable penalty. The only advantage to setting a smaller size: protecting yourself against cost overruns. There might be a possibility it takes longer to initialize a 1TB drive, but I haven't measured it.
I have not yet found a lot of use for Azure Drives given some of the limitations that they have and the other storage options that are available, so I have only done some playing with them, not actually used one in a production environment.
With that said, based on my understanding, and the description you give in your question about only being charged for the amount of content stored on the drive I do not see any issue with creating a large drive initially and growing into it in the future.
Hope that helps some, even if it is just a - yes I think you understand it correctly!
The reason is pretty simple if you tried it. Namely, while you are not charged except for the data inside the drive, it does count against your quota limit. So, if every drive was 1TB, then you could create only 99 drives (think overhead here) before your storage account quota was gone. Also, yes, it does take longer to create a 1TB drive versus a smaller one (in practice).

With Azure should i store my static shared JS and images files in the package release or in a blob?

Speed and cost in mind.
Say I have a few JS and images files shared for multiple websites. that is not huge images files, this is only few static files like PNG sprites and common JS files.
I'm kind of lost on the choice :
- Should i keep it in my webpackage to release in Azure ?
- Or should i put these in blobs ?
The things I don't know is if i have a lot of hits on the blob solution, it might cost more than the hits on the IIS level of the package ?
Right, wrong ?
Edit : I realize storing JS files on the blob won't deliver it gziped ?
No need for the blobs that I can see. The database round trip isn't adding value. I'd just put the static content on the web server and let it serve it up. Let the web server handle compressing the bytes on the wire for those cases where the client indicates that they can handle GZIP compression.
Will your JS and image files be modified often? If so, putting them into the service package would mean that every time you want to update those files, you will have to recompile the service package and redeploy your instance. If you find yourself needing to update often, this will become cumbersome. From a speed perspective, you're not going to see too much of a difference between service them files up from the blogs or serving them up from the web role (assuming the files are in fact not huge). Last but not least, from a cost perspective, if you look at the cost of blob storage ($0.15 per GB stored per month, $0.01 per 10,000 storage transactions), its really not much. Your site would have to have a lot of traffic for the cost to become significant.

Resources