I had been using Azure ML studio for a while now and it was really fast but now when I try to unzip folders containing images around 3000 images using
!unzip "file.zip" -d "to unzip directory"
it took more than 30 minutes and other activities(longer concatenation methods) also seem to take a long time even using numpy arrays. Wondering if it is something with configuration or other problems. I have tried switching locations, creating new resource groups, workspaces, changing computes(Both CPU and GPU).
Compute and other set of current configurations can be seen on the image
When you are using a notebook, your local directory is persisted on a (remote) Blob Store. Consequently, you are limited by network delays and more significantly the IOPS your compute agent has.
What has worked for me is to use the local disk mounted on the compute agent. NOTE: This is not persisted and all the stuff on this will disappear when the compute agent is stopped.
After doing all your work, you can move the data to your persistent storage (which should be in your list of mounts). This might still be slow but you don't have to wait for it to complete.
Related
Posting here as server fault doesn't seem to have the detailed Azure knowledge.
I have a Azure storage account, a file share. This file share is connected to a Azure VM through mapped drive. A FTP server on the VM accepts a stream of files and stores them in the File Share directly.
There are no other connections. Only I have Azure admin access, limited support people have access to the VM.
Last week, for unknown reasons 16 million files, which are nested in many sub-folders (origin, date) moved instantly into a unrelated subfolder, 3 levels deep.
I'm baffled how this can happen. There is a clear instant cut off when files moved.
As a result, I'm seeing increased costs on LRS. I'm assuming because internally Azure storage is replicating the change at my expense.
I have attempted to copy the files back using a VM and AZCOPY. This process crashed midway through leaving me with a half a completed copy operation. This failed attempt took days, which makes me confident I wasn't the support guys dragging and moving a folder by accident.
Questions:
Is it possible to just instantly move so many files (how)
Is there a solid way I can move the files back, taking into account the half copied files - I mean an Azure backend operation way rather than writing an app / power shell / AZCOPY?
So there a cost efficient way of doing this (I'm on Transaction Optimised tier)
Do I have a case here to get Microsoft to do something, we didn't move them... I assume something internally messed up.
Thanks
A tool that supports server-side copy (like AzCopy) can move the files quickly because only the metadata is updated. If you wants to investigate the root cause, I recommend opening a support case. (To sort this out – Your best bet is to connect with our Azure support team by filing a ticket, our support team on best effort basis can help you guide on this matter. )
It's taking 1 hour to download a 48gb dataset with 90000 files.
I am training an image segmentation model on Azure ML pipeline using compute target p100-nc6s-v2.
In my script I'm accessing Azure Blob Storage using DataReference's as_download() functionality. The blob storage is in the same location as workspace (using get_default_datastore).
Note: I'm able to download complete dataset to local workstation within a few minutes using az copy.
When I tried to use as_mount() the first epoch was extremely slow (4700 seconds vs 772 seconds for subsequent epochs).
Is this expected behavior? If not, what can be done to improve dataset loading speed?
Working folder of the run is mounted cloud storage, which could be defaulting to file storage in your workspace.
Can you try setting blob datastore instead, and see if perf improves?
run_config.source_directory_data_store = 'workspaceblobstore'
as_download() downloads the data to the current working directory, which is a mouted file-share (or blob if you do what #reznikov suggested).
Unfortunately, for small files, neither blob nor file-share are very performant (although blob is much better) -- see this reply for some measurements: Disk I/O extremely slow on P100-NC6s-V2
When you are mounting, the reason that the first epoch is so slow, lies in the fact that blob fuse (which is used for mounting blobs) caches to the local SSD, so for after the first epoch, everything is on your SSD and you get the full performance.
As for why the first epoch takes much longer than the az copy, I would suspect that the data reader of the framework you are using does not pipeline the reads. What are you using?
You could try one of 2 things:
Mount, but at the beginning of the job, copy the data from the mount path to /tmp and consume it from there.
If #1 is significantly slower than az copy, then, don't mount. Instead, at the beginning of the job, copy the data to /tmp using az copy.
I'm trying to download data of azure blob storage container to my machine. It contains of multiple small files, 12-60 KB each. When I use Microsoft Azure Storage Explorer app, it downloads no more than few hundreds of items at once and then halts for tens of minutes before trying to download a next batch.
This makes speed of download roughly less than 3 KB/s, which is quite horrible.
I've also tried using open source npm package to download container files, with similar results.
Is there a way to decrease latency/increase speeds? Or is there a better way do download all container data?
Actually it depend on your (local) machine's network speed. So you can try create a small instance in same region, and download to that instance. It will much faster than your local machine . Then try to archive total files and ftp back to your local machine.
We're using Managed VMs, and can currently serve files from the local disk in the VM (which is a standard magnetic HD), as well as serving from Google Cloud Storage (which is also backed by magnetic HDs).
https://cloud.google.com/appengine/docs/managed-vms/
As we're working with large files (high-res geo images) in a latency-sensitive context, we'd like to be able to use Local SSDs with our Managed VMs app (it's okay that the data is not persistent, it just needs to be fast and work with large files). At some point, we may want to use other services that are fast and designed for working with large files (e.g. Blobstore?), but we have a workflow already set up to work with files so it should be easiest to set up a faster file system now. Is it possible to use Local SSD storage with Managed VMs?
Here's info on Local SSDs. They need to be created at instance creation time (for Google Compute instances, which Managed VMs are creating behind the scenes). It looks like Local SSDs can be created via command line, gcloud compute, or an API, but it's not clear where we'd configure any of these things since Managed VMs is doing the instance creation for us. Presumably we'd do this in app.yaml, Dockerfile, or in a gcloud command, but it's not obvious how this would work.
https://cloud.google.com/compute/docs/disks/local-ssd
Apologies, this isn't currently available. We don't really let you customize the VM specs, aside from the disk size. If you max out the disk size, you will get the highest IOPS we support. We kicked around the idea of tmpfs, but it's not available just yet.
I know that we can use the VM Depot to get started with the Neo4J in Azur but one thing that is not clear is where should we physically store the DB files. I tried to look around in the net if there are any recommendations on where the physical files would be stored so that then a VM crashes or restarts, the data is not lost.
can someone share their thoughts or point me to a address where some more details can be found on do and don'ts of Neo4j on Azure for a production environment.
Regards
Kiran
When you set up a Neo4j VM via VM Depot, that image, by default, configures the database files to reside within the same VM as the server itself. The location is specified in neo4j-server.properties. This lets you simply spin up the VM and start using Neo4j immediately.
However: You'll soon discover that your storage space is limited (I believe the VM instances are set up with a 127GB disk). To work with larger databases, you'll need to attach an additional disk (or disks), each disk up to 1TB in size. These disks, as well as the main VM disk, are backed by blob storage, meaning they're durable - persistent disks.
How you ultimately configure this is up to you, depending on the size of the database and its purpose. The only storage to avoid, if you need persistence, is the scratch disk provided (which is a locally-attached drive with no durability).
The documentation announcing that VM doesn't say. But when you install neo4j as a package on to other similar linux systems (the VM in question is a linux VM) then the data usually goes into /var/lib/neo4j/data. Here's an example:
user#host:/var/lib/neo4j/data$ pwd
/var/lib/neo4j/data
user#host:/var/lib/neo4j/data$ ls
graph.db keystore log neo4j-service.pid README.txt rrd
user#host:/var/lib/neo4j/data$ cat README.txt
Neo4j Data
=======================================
This directory contains all live data managed by this server, including
database files, logs, and other "live" files.
The main directory you really have to have is the "graph.db" directory. That's going to contain the bulk of the data. May as well back up the entirety of this directory. Some of the files (like the .pid file and the README.txt) of course aren't needed.
Now, there's no guarantee that in the VM that it's going to be /var/lib/neo4j/data but it's going to be something very similar. And what you're going to want is going to be a directory whose name ends in .db since that's the default for new neo4j databases.
To narrow down further, once you get that VM running, just run updatedb then locate *.db | grep neo4j and that's almost certain to find it quickly.