Is it possible to use Local SSD Storage with Google Managed VMs? - managed-vm

We're using Managed VMs, and can currently serve files from the local disk in the VM (which is a standard magnetic HD), as well as serving from Google Cloud Storage (which is also backed by magnetic HDs).
https://cloud.google.com/appengine/docs/managed-vms/
As we're working with large files (high-res geo images) in a latency-sensitive context, we'd like to be able to use Local SSDs with our Managed VMs app (it's okay that the data is not persistent, it just needs to be fast and work with large files). At some point, we may want to use other services that are fast and designed for working with large files (e.g. Blobstore?), but we have a workflow already set up to work with files so it should be easiest to set up a faster file system now. Is it possible to use Local SSD storage with Managed VMs?
Here's info on Local SSDs. They need to be created at instance creation time (for Google Compute instances, which Managed VMs are creating behind the scenes). It looks like Local SSDs can be created via command line, gcloud compute, or an API, but it's not clear where we'd configure any of these things since Managed VMs is doing the instance creation for us. Presumably we'd do this in app.yaml, Dockerfile, or in a gcloud command, but it's not obvious how this would work.
https://cloud.google.com/compute/docs/disks/local-ssd

Apologies, this isn't currently available. We don't really let you customize the VM specs, aside from the disk size. If you max out the disk size, you will get the highest IOPS we support. We kicked around the idea of tmpfs, but it's not available just yet.

Related

Azure ML studio really slow

I had been using Azure ML studio for a while now and it was really fast but now when I try to unzip folders containing images around 3000 images using
!unzip "file.zip" -d "to unzip directory"
it took more than 30 minutes and other activities(longer concatenation methods) also seem to take a long time even using numpy arrays. Wondering if it is something with configuration or other problems. I have tried switching locations, creating new resource groups, workspaces, changing computes(Both CPU and GPU).
Compute and other set of current configurations can be seen on the image
When you are using a notebook, your local directory is persisted on a (remote) Blob Store. Consequently, you are limited by network delays and more significantly the IOPS your compute agent has.
What has worked for me is to use the local disk mounted on the compute agent. NOTE: This is not persisted and all the stuff on this will disappear when the compute agent is stopped.
After doing all your work, you can move the data to your persistent storage (which should be in your list of mounts). This might still be slow but you don't have to wait for it to complete.

Azure VM Software Installation RDP

I am new to the cloud and have a very basic questions that I am having a hard time understanding.
I have created an Azure Virtual machine and now I am installing third party software using RDP. Example: BareTail, NotePad++, a Trading Software(TWS), the goal is to replace my own Desktop/PC with the one in cloud, to help me when I am travelling.
Question: How often will i have to re-install thee s/w ? Or is it a one and done ? I am hoping only one time, but not sure.
Thank You.
In Azure, by default, if you do not attach any data disk you will have a persistent system disk and temporary swap disk.
Just do not install / do not put any data on a temporary disk and all your data will persist until you pay for your subscription or you will remove your VM with OS Disk.
Even if you remove Virtual Machine resource, Azure will not remove you OS disk, so you will able to find your data but you need to use command line tools to create a Virtual Machine from existing OS disk to recover your data, so be careful. You can use Azure Locks to protect your resources from deletion.
If you want to protect data on your disks from corruption you have also Azure Backup.
Start / Stop operations do not have any impact on your data and software if it was not placed on a temporary disk.

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?
I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.
Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

How to share Data disk among many virtual machines running Linux in Azure?

I have my virtual machine running Linux. I've created it via new "Resource manager". Then I added data disk to it.
Then I created new Virtual Machine. And I want it to use the same data disk attached to the first one (at least in read-only mode).
When I try to "attach existing disk" to this new machine I get this error:
Failed to attach existing disk 'DISK-NAME.vhd' to the virtual machine 'MACHINE-NAME'. Error: Failed to acquire lease while creating disk 'DISK-NAME.vhd' using blob with URI https://BLOB-URI-disk1.vhd. Blob is already in use.
How do I attach existing data disk which is in use by another machine to my current machine?
Simply, you can't.
A disk in Azure can only be attached to a single VM at a time. in order to attach it to another VM you need to disconnect it from the first.
If you need to have data shared amongst many machines, you could use Azure File shares which provides SMB 2.1 and SMB 3.0. Most modern Linux versions can connect to this quite seamlessly.
If you need block storage, i.e. sharing an actual disk, you would need to spin up a separate VM and use a protocol like iscsi (or NFS) to share that disk amongst multiple machines.
Maybe the "StorSimple" Solution from Microsoft Azure could be a way to go? I would describe it as SAN on Azure.
I have not tested it today, but it should be possible to connect several virtual machines to it, and share the files.
You can find more information in the documentation:
Azure StorSimple
Extending Michael's answer a bit, based on your comments under his answer:
If the goal is to provide data access when a VM goes down for some reason, and the data is on an attached disk, then you can detach the disk from the downed VM and reattach it to another VM (the detach breaks the lease, and the reattach creates a new lease). But be aware: This is a time-consuming process - it might take a minute or two for each operation. But you can certainly do it, and you can do it programmatically.
Regarding disk replication: Yes, Azure disks are triple-replicated (or replicated 6 times, if you enable georeplication). But logically, it's a single disk; it's replicated for durability, not for you to attach to different replicas.
Michael mentioned Azure File Service. Maybe it wasn't clear what that was but... there's no Virtual Machine involved with File Service - it's a durable-storage SMB service, with its own SLA unrelated to your VMs. You may attach to it from multiple VMs and read/write files as you would a locally-attached disk (which seems to be the problem you're trying to tackle).
Regarding replication of data across VM's: If you choose to go this route, and make physical copies yourself, it's strictly up to you how you do it - there is no "best way." But this is the type of thing database engines are built for (and you can imagine how complex they are, dealing with replication, journaling, errors, etc.).

Neo4j Azure hosting and Database location

I know that we can use the VM Depot to get started with the Neo4J in Azur but one thing that is not clear is where should we physically store the DB files. I tried to look around in the net if there are any recommendations on where the physical files would be stored so that then a VM crashes or restarts, the data is not lost.
can someone share their thoughts or point me to a address where some more details can be found on do and don'ts of Neo4j on Azure for a production environment.
Regards
Kiran
When you set up a Neo4j VM via VM Depot, that image, by default, configures the database files to reside within the same VM as the server itself. The location is specified in neo4j-server.properties. This lets you simply spin up the VM and start using Neo4j immediately.
However: You'll soon discover that your storage space is limited (I believe the VM instances are set up with a 127GB disk). To work with larger databases, you'll need to attach an additional disk (or disks), each disk up to 1TB in size. These disks, as well as the main VM disk, are backed by blob storage, meaning they're durable - persistent disks.
How you ultimately configure this is up to you, depending on the size of the database and its purpose. The only storage to avoid, if you need persistence, is the scratch disk provided (which is a locally-attached drive with no durability).
The documentation announcing that VM doesn't say. But when you install neo4j as a package on to other similar linux systems (the VM in question is a linux VM) then the data usually goes into /var/lib/neo4j/data. Here's an example:
user#host:/var/lib/neo4j/data$ pwd
/var/lib/neo4j/data
user#host:/var/lib/neo4j/data$ ls
graph.db keystore log neo4j-service.pid README.txt rrd
user#host:/var/lib/neo4j/data$ cat README.txt
Neo4j Data
=======================================
This directory contains all live data managed by this server, including
database files, logs, and other "live" files.
The main directory you really have to have is the "graph.db" directory. That's going to contain the bulk of the data. May as well back up the entirety of this directory. Some of the files (like the .pid file and the README.txt) of course aren't needed.
Now, there's no guarantee that in the VM that it's going to be /var/lib/neo4j/data but it's going to be something very similar. And what you're going to want is going to be a directory whose name ends in .db since that's the default for new neo4j databases.
To narrow down further, once you get that VM running, just run updatedb then locate *.db | grep neo4j and that's almost certain to find it quickly.

Resources