I want to download/upload files to Azure in parallel.
AzCopy, by default does not allow multiple runs on the same copy because of the locks on journal files. I am running multiple Azcopy instances on the same machine by pointing each of these instances to different journal files (using /Z )
But what is the bottle-neck in doing this? Bandwidth is obvious, but what is the bottleneck from Azure's side.
There is no real bottleneck from Azure's sides. Keep in mind though that the transfers are done with spare bandwidth and that there is no SLA as to whether it'll be fast or slow. That's all.
The other bottleneck you may need to check is your local CPU, when running over 4 AzCopy instances in parallel with 4 parallel uploads each my i7 starts to sweat a bit.
Related
I am trying to copy millions of small csv files from my storage account to a physical machine using the azcopy command, and I noticed that the speed has been very slow.
The format of azcopy command is
Azcopy Copy <Storage_account_source> --recursive --overwrite=True
And the Command is ran from the Physical Machine.
Is there a way where you can make azcopy download multiple blobs concurrently? instead of checking the blob one by one? I believe that's why the speed is dropping to such a low value of 1 mb/second as it's doing checks on these really small blobs one by one. Or if there is another way to increase the speed of this case of blob transfer?
azcopy is highly optimized for throughput using parallel processing etc. I haven't come across any tools that provide you faster download speed overall. The main limiting factors in my experience are usually (obviously) network bandwidth but also CPU power. It uses a lot of compute resources. So can you maybe increase those two on your VM at least for the duration of the download?
I am looking for some suggestions or resources on how to size-up servers for a spark cluster. We have enterprise requirements that force us to use on-prem servers only so I can't try the task on a public cloud (and even if I used fake data for PoC I would still need to buy the physical hardware later). The org also doesn't have a shared distributed compute env that I could use/I wasn't able to get good internal guidance on what to buy. I'd like to have some idea on what we need before I talk to a vendor who would try to up-sell me.
Our workload
We currently have a data preparation task that is very parallel. We implement it in python/pandas/sklearn + multiprocessing package on a set of servers with 40 skylake cores/80 threads and ~500GB RAM. We're able to complete the task in about 5 days by manually running this task over 3 servers (each one working on a separate part of the dataset). The tasks are CPU bounded (100% utilization on all threads) and usually the memory usage is low-ish (in the 100-200 GB range). Everything is scalable to a few thousand parallel processes, and some subtasks are even more paralellizable. A single chunk of data is in 10-60GB range (different keys can have very different sizes, a single chunk of data has multiple things that can be done to it in parallel). All of this parallelism is currently very manual and clearly should be done using a real distributed approach. Ideally we would like to complete this task in under 12 hours.
Potential of using existing servers
The servers we use for this processing workload are often used on individual basis. They each have dual V100 and do (single node, multigpu) GPU accelerated training for a big portion of their workload. They are operated bare metal/no vm. We don't want to lose this ability to use the servers on individual basis.
Looking for typical spark requirements they also have the issue of (1) only 1GB ethernet connection/switches between them (2) their SSDs are configured into a giant 11TB RAID 10 and we probably don't want to change how the file system looks like when the servers are used on individual basis.
Is there a software solution that could transform our servers into a cluster and back on demand or do we need to reformat everything into some underlying hadoop cluster (or something else)?
Potential of buying new servers
With the target of completing the workload in 12 hours, how do we go about selecting the correct number of nodes/node size?
For compute nodes
How do we choose number of nodes
CPU/RAM/storage?
Networking between nodes (our DC provides 1GB switches but we can buy custom)?
Other considerations?
For storage nodes
Are they the same as compute nodes?
If not how do we choose what is appropriate (our raw dataset is actually small, <1TB)
We extensively utilize a NAS as a shared storage between the servers, are there special consideration on how this needs to work with a cluster?
I'd also like to understand how I can scale up/down these numbers while still being able to viably complete the parallel processing workload. This way I can get a range of quotes => generate a budget proposal for 2021 => buy servers ~Q1.
Situation:
A user with a TB worth of files on our Azure blob storage and gigabytes of storage in our Azure databases decides to leave our services. At this point, we need to export all his data into 2GB packages and deposit them on the blob storage for a short period (two weeks or so).
This should happen very rarely, and we're trying to cut costs. Where would it be optimal to implement a task that over the course of a day or two downloads the corresponding user's blobs (240 KB files) and zips them into the packages?
I've looked at a separate webapp running a dedicated continuous webjob, but webjobs seem to shut down when the app unloads, and I need this to hibernate and not use resources when not up and running, so "Always on" is out. Plus, I can't seem to find a complete tutorial on how to implement the interface, so that I may cancel the running task and such.
Our last resort is abandoning webapps (three of them) and running it all on a virtual machine, but this comes up to greater costs. Is there a method I've missed that could get the job done?
This sounds like a job for a serverless model on Azure Functions to me. You get the compute scale you need without paying for idle resources.
I don't believe that there are any time limits on running the function (unlike AWS Lambda), but even so you'll probably want to implement something to split the job up first so it can be processed in parallel (and to provide some resilience to failures). Queue these tasks up and trigger the function off the queue.
It's worth noting that they're still in 'preview' at the moment though.
Edit - have just noticed your comment on file size... that might be a problem, but in theory you should be able to use local storage rather than doing it all in memory.
I have an executable that performs long calculations and I want to run those calculations on Azure. What would be the optimal service - batch or VM perhaps?
Azure batch or VM scale sets. Azure Batch is based on top of scale sets and is more specifically designed for task/jobs while VM scalesets help for scaling generic VMs.
Use cases for Batch:
Batch is a managed Azure service that is used for batch processing or batch computing--running a large volume of similar tasks to get some desired result. Batch computing is most commonly used by organizations that regularly process, transform, and analyze large volumes of data.
Batch works well with intrinsically parallel (also known as "embarrassingly parallel") applications and workloads. Intrinsically parallel workloads are easily split into multiple tasks that perform work simultaneously on many computers.
More info here for batch: https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview/
if you can change the doctype to multi-part and you're able to suspend your long job every minute or so and update progress, that will make it more user interactive and stops the http connection timing out. you could also add a cancel job button? or is the question about something else?
I am facing a problem of slow execution of exe in Azure platform
Following are the Steps:
Read data from SQL Azure Server& CSV files & display in on HTML5 pages.
Write data on CSV files.
Executing a external Fortron exe, which reads data from csv files generated in step 23.
Fortron exe after calculations write data on .txt file.
Read text file data generated in step 5 & display it on HTML5 pages.
Issue:
In point # 3, when we are invoking fortron exe using process start method, then –
On local machines in usually take 17~18 secs
On cloud server this is taking 34~35 secs.
Rest all other activities are taking same time on local as well as cloud server.
Regarding step 3: What size local machine are you using (e.g. number of cores), since you're running an exe that may be doing some number-crunching. Now compare that the the machine size allocated in Windows Azure? Are you using an Extra Small (shared core) or Small (single core)? Plus what size cpu does your local machine have? If you're not comparing like-kind configurations, you'll certainly have performance differences. Same goes for RAM (an Extra Small offers 768MB, with Small through XL offering 1.75GB per core) and bandwidth (XS has 5Mbps, Small through XL have 100Mbps per core).
The Azure systems have slower IO process than a local server this will be the reason you see the performance impact you are also on a shared system so your IO also may vary depending on your neighbours also and server load. If you are task is IO intensive the best bet is to run a VM and you need to persist the data is to attach multiple disks to the VM and then use stripping on the disk.
http://www.windowsazure.com/en-us/manage/windows/how-to-guides/attach-a-disk/
Striped IO Disks performance stats.
http://blinditandnetworkadmin.blogspot.co.uk/2012/08/vm-io-performance-on-windows-azure.html
You will need to have a warm set of disks to get true performance set.
Also I found the Temp storage on VM Normally the D drive to have very good IO so maybe worth if you are going to use a VM to try there first.