Use local IDE such as Microsoft visual code to use additional compute - python-3.x

I am trying to figure out the best way about how we can use local IDE such as microsoft visual studio code to use distributed computing power. Currently, we are brining data locally but it doesn't seem like sustainable solution because of reasons like in future scale of data will grow, cloud data security, etc. One workaround we thought of is to tunnel into EC2 instances but would like to hear what's best way to solve this in machine learning/data science environment (we are using databricks and AWS services).

Not sure why you are connecting IDE to ccomputer . I have used VS Code for running scripts against HDInsight cluster . Before I fire by scripts I do configure the clusters against which it is going to run . The same is true on the Databricks also.

Related

Switching between Databricks Connect and local Spark environment

I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.
Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?
My 2 cents, since I've been done this type of development for some months now:
Work with two Python environments: one with databricks-connect (and thus, no pyspark installed), and another one with only pyspark installed. When you want to execute the tests, just activate the "local" virtual environment and run pytest as usual. Make sure, as some commenters pointed out, that you are initializing the pyspark session using SparkConf().setMaster("local").
Pycharm helps immensely to switch between environments during development. I am always on the "local" venv by default, but whenever I want to execute something using databricks-connect, I just create a new Run configuration from the menu. Easy peasy.
Also, be aware of some of databricks-connect's limitations:
It is not officially supported anymore, and Databricks recommend moving towards dbx whenever possible.
UDFs just won't work in databricks-connect.
Mlflow integration is not reliable. In my use case, I am able to download and use models, but unable to log a new experiment or track models using databricks tracking uri. This might depend on your Databricks Runtime, mlflow and local Python version.

Remote Execution platform

I’m looking for a framework/platform that would allow me to execute remote commands on a Windows machine and report back the results.
These machines would be public outside our company network, probably behind firewalls, proxies, etc. We have complete access over them and can configure them in any way we want. Think ATMs with 3G network.
I guess what i’m looking is something like SaltStack remote execution. But that enterprise plan has a hight cost per minion, and I need to install it on the thousands.
Another possible solution would be something like Octopus Deploy, Azure DevOps or any CD tool for that matter but without the need for environments.
I’ve looked also at ansible, but without an agent to overcome the target being behind firewalls, routers, proxies, I’m not sure how the reverse connection would work.
I would like to avoid Puppet or Chef for now. Ideally a cloud based solution would be wonderful, especially in azure.
Any recommends, directions?
Octopus Deploy is currently working on "Ops Processes" which sounds like it might fit what you are looking for. It's on our roadmap if you are interested, and we are planning to have the first round of features from this ready to ship in the next 8 weeks or so.
Caveat: I work at Octopus so read into that what you will

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?
I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.
Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

How to setup local cluster for MBrace

I'm trying to follow tutorials on using MBrace with f# (one is here (youtube video). The problem is that with all the videos I've seen, they are either using Azure or running some form of local cluster on the machine.
Since I'll not be using Azure now, how do I setup a local cluster which I can use to test mbrace locally without having to go online?
If you want to test MBrace with a local cluster on your machine you can
git clone https://github.com/mbraceproject/MBrace.Core and for a sample check this https://github.com/mbraceproject/MBrace.Core/blob/master/samples/wordcount.fsx
One important note is that we are currently working towards MBrace 1.0 and you may find some API differences between MBrace.Core and MBrace.StarterKit (https://github.com/mbraceproject/MBrace.StarterKit)

Best solution to host a (command line) Windows application?

I have a Windows application that does some calculations and is called from command line. On my Windows machine, I have a PHP script running under Apache that executes the application and shows the output.
Is there any hosting solution that I can use to do the same? I can't figure out if EC2 or Azure are the right solutions. Basically, I need a web server + ability to execute my application.
Suggestions? Thanks.
You can host your application on AppHarbor, the .NET Platform-as-a-Service. You can either port your web frontend to .NET or try to get your PHP stuff working with Phalanger. AppHarbor is working on Background Tasks, which might be a good match for your workload.
I would just run the PHP script you already have under IIS in a Windows Azure web role.
If it is a Windows Application and you have the source code I would go with an Azure Worker Role. The advantage of using a PaaS (as Azure) instead of an IaaS (as Amazon) is that you wont have to bother of keeping the server up to date.
The real investment in time will be when you rewrite your application to make it work as a Worker Role. The time needed to do this work depends on how your application works right now. If is uses a lot of disc access it might be difficult and perhaps an Amazon server would be better. But if it only crunches numbers in memory an Azure Worker Role is a very good candidate.
The real advantage of using an Amazon server is that you probably wont need to do any work at all. Except maintaining the server.
As described in the question both Azure and EC2 will do the job very well. This is the kind of task both systems are designed for.
So the question becomes really: which is best? That depends on two things: what the application needs to do and your own experience and preference.
As it's a Windows application there should probably be a leaning towards Azure. While EC2 supports Windows, the tooling and support resources for Azure are probably deeper at this point.
If cost is a factor then a (somewhat outdated) resource is here: http://blog.mccrory.me/2010/10/30/public-cloud-hourly-cost-comparison/ -- the conclusion is that, by and large, Azure and Amazon are roughly similar for compute charges.
Steve Marx has a blog post that describes how to run another web server (i.e not IIS) on Azure
This potentially has everything you need - you can deploy Apache and your executable and run it in exactly the same way.
Alternatively - you can deploy your executable along side a bit of code in a worker role that would run that application periodically, all depending on your exact requirements

Resources