I was wondering as to what the best practices are around mounting and unmounting in Databricks using dbfs.
More Details:
We are using Azure Data Lake Storage. We have multiple notebooks and in each of the notebooks we have code that calls mount, processes files, and then unmounts at the end (using code similar to https://forums.databricks.com/questions/8103/graceful-dbutils-mountunmount.html). But, it looks like mount points are shared by all notebooks. So I was wondering as to what would happen if 2 notebooks began running around the same time, and one was running faster than the other. Could we get into a situation where the first notebook could end up unmounting the dbfs, while the 2nd notebook is still in the midst of its processing?
So should one be mounting within a notebook, or should this be done in some sort of initialization routine that all notebooks should call? Similarly, should one try and unmount within a notebook, or should we just not bother with unmounts?
Are there any best practices I should be following?
Note: I am a newbie to Databricks and Python.
Mounting is usually done once per storage account/container/... It makes no sense to repeat it again & again, and re-mounting when somebody works with data may lead to data losses - we have seen that in the practice. So it's better to mount everything at once, when creating workspace, or when adding the new storage account/container, and don't remount it. From automation standpoint I would recommend to use corresponding resources in the Databricks Terraform provider.
But the main problem with mounts is that anyone in workspace can use it, and data access will happen under the permissions of those who mounted it (for example, service principal) - because of this, mounts are very bad from security point of view (until we use so-called credential passthrough mounts available on Azure)
The mount point will interfere with each other if multiple notebooks are running at the same time and accessing the same set of mount points. It is better to have one notebook to initialize all the required mount points at one place and call this notebook inside all the different notebooks. It is possible to invoke a notebook inside another notebook in databricks.
Related
I had been using Azure ML studio for a while now and it was really fast but now when I try to unzip folders containing images around 3000 images using
!unzip "file.zip" -d "to unzip directory"
it took more than 30 minutes and other activities(longer concatenation methods) also seem to take a long time even using numpy arrays. Wondering if it is something with configuration or other problems. I have tried switching locations, creating new resource groups, workspaces, changing computes(Both CPU and GPU).
Compute and other set of current configurations can be seen on the image
When you are using a notebook, your local directory is persisted on a (remote) Blob Store. Consequently, you are limited by network delays and more significantly the IOPS your compute agent has.
What has worked for me is to use the local disk mounted on the compute agent. NOTE: This is not persisted and all the stuff on this will disappear when the compute agent is stopped.
After doing all your work, you can move the data to your persistent storage (which should be in your list of mounts). This might still be slow but you don't have to wait for it to complete.
I have a three container group running on Azure, and an additional run once container which I'm aiming to run once a day to update data on a file mount (which the server containers look at). Additionally I'm looking to then restart the containers in the group once this had updated. Is there any easy way to achieve this with the Azure stack?
Tasks seems like the right kind of thing, however I seem to only be able to mount secrets rather than standard volumes, which makes it not able to do what's required. Is there another solution I'm missing?
Thanks!
I am looking for a best solution for a problem where lets say an application has to access a csv file (say employee.csv) and does some operations such as getEmployee or updateEmployee etc.
Which Volume is best suitable for this and why?
Please note that employee.csv will have some pre-loaded data already.
Also to be precise we are using azure-cli for handling kubernetes.
Please Help!!
My first question would be: is your application meant to be scalable (i.e. have multiple instances running at the same time)? If that is the case, then you should choose a volume that can be written by multiple instances at the same time (ReadWriteMany, https://kubernetes.io/docs/concepts/storage/persistent-volumes/). As you are using Azure, the AzureFile volume could fit your case. However, I am concerned that there could be a conflict with multiple writers (and some data may be lost). My advice would be to better use a Database System so you avoid this kind of situations.
If you only want to have one writer, then you could use pretty much any of them. However, if you use local volumes you could have problems when a pod get rescheduled on another host (it would not be able to retrieve the data). Given the requirements that you have (a simple csv file), the reason I would give you for using one PersistentVolume provider instead of another would be the less painful to setup. In this sense, just like before, if you are using Azure you could simply use an AzureFile volume type, as it should be more straightforward to configure in that cloud: https://learn.microsoft.com/en-us/azure/aks/azure-files
My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?
I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.
Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.
For some scenarios a clustered file system is just too much. This is, if I got it right, the use case for the data volume container pattern. But even CoreOS needs updates from time to time. If I'd still like to minimise the down time of applications, I'd have to move the data volume container with the app container to an other host, while the old host is being updated.
Are there best practices existing? A solution mentioned more often is the "backup" of a container with docker export on the old host and docker import on the new host. But this would include scp-ing of tar-files to an other host. Can this be managed with fleet?
#brejoc, I wouldn't call this a solution, but it may help:
Alternative
1: Use another OS, which does have clustering, or at least - doesn't prevent it. I am now experimenting with CentOS.
2: I've created a couple of tools that help in some use cases. First tool, retrieves data from S3 (usually artifacts), and is uni-directional. Second tool, which I call 'backup volume container', has a lot of potential in it, but requires some feedback. It provides a 2-way backup/restore for data, from/to many persistent data stores including S3 (but also Dropbox, which is cool). As it is implemented now, when you run it for the first time, it would restore to the container. From that point on, it would monitor the relevant folder in the container for changes, and upon changes (and after a quiet period), it would back up to the persistent store.
Backup volume container: https://registry.hub.docker.com/u/yaronr/backup-volume-container/
File sync from S3: https://registry.hub.docker.com/u/yaronr/awscli/
(docker run yaronr/awscli aws s3 etc etc - read aws docs)