How to copy local MLflow run to remote tracking server? - mlflow

I am currently tracking my MLflow runs to a local file path URI. I would also like to set up a remote tracking server to share with my collaborators. One thing I would like to avoid is to log everything to the server, as it might soon be flooded with failed runs.
Ideally, I'd like to keep my local tracker, and then be able to send only the promising runs to the server.
What is the recommended way of copying a run from a local tracker to a remote server?

To publish your trained model to a remote MLflow server you should use 'register_model' API. For example, if you are using spacy flavor of MLflow you can use as below, where 'nlp' is the trained model:
mlflow.spacy.log_model(spacy_model=nlp, artifact_path='mlflow_sample')
model_uri = "runs:/{run_id}/{artifact_path}".format(
run_id=mlflow.active_run().info.run_id, artifact_path='mlflow_sample'
)
mlflow.register_model(model_uri=model_uri, name='mlflow_sample')
Make sure that the following environment variables should be set. In below example S3 storage is used:
SET MLFLOW_TRACKING_URI=https://YOUR-REMOTE-MLFLOW-HOST
SET MLFLOW_S3_BUCKET=s3://YOUR-BUCKET-NAME
SET AWS_ACCESS_KEY_ID=YOUR-ACCESS-KEY
SET AWS_SECRET_ACCESS_KEY=YOUR-SECRET-KEY

I have been interested in a related capability of copying runs from one experiment to another for a similar reason, ie set one area for arbitrary runs and another into which the results for promising runs that we move forward with are moved. Your scenario with separate tracking server is just the generalization of mine. Either way, apparently there is not a feature for this capability built-in to Mlflow currently. However, the mlflow-export-import python-based tool looks like it may cover both our use cases, and it cites usage on both Databricks and the open-source version of Mlflow, and it appears current as of this writing. I have not tried using this tool yet myself though - if/when I try it I'm happy to jot a follow-up here saying whether it worked well for this purpose, and/or anyone else could do same. Thanks and cheers!

Related

Intermediate step(s) between manual prod and CI/CD for Node/Next on EC2

For about 18 months now I've been working in Node; and for the last 6 months I've been slowly migrating my existing WordPress websites to NextJS.
To date, I've been deploying to production manually. I log into my production server, checkout the latest release from GitHub, build, and do a pm2 restart.
Even though the above workflow seems to be the most commonly documented around the internet, it's always felt a little wrong to me.
Recently, I found myself in a situation where I needed to customise some 3rd party code. So, my main code now has a line in package.json that says
{
...
"dependencies": {
...
"react-share": "file:../react-share/react-share-4.4.1.tgz",
...
},
...
}
which implies that I'm going to checkout my custom react-share, build it somewhere on the production server, change this line to point to wherever I put it, and then rebuild.
Also, I'm using Prisma, which means that every time I deploy, before I do a build, I need to do an npx prisma generate to create the client.
This now all seems really, really wrong.
I don't know how a "simple" CI/CD environment might look, but whatever it looks like, it feels like overkill. It's just me doing development, and my production environment is a single EC2 server sitting behind AWS CloudFront.
It seems to me that I should be doing something more/different than what I'm currently doing, in service to someday moving to a CI/CD model, if/when I have a whole team working on this, or sufficient users that I have multiple load-balanced servers and need production to be continually up.
In particular, it feels like I shouldn't be building on the production server.
Are there any intermediary step(s) I can/should be taking for faster/less-error-prone/less-down-time deployment to a single EC2 instance for Next/Node apps, between manually deploying as I am currently, and some sort of CI/CD setup? Or are my only choices to do what I'm doing now, or go research how to do CI/CD?
You're approaching towards your initial stages of what technically is called DevOps, if not already as it appears from your context. What you're asking is a broad topic, which is an understatement, and explaining each and everything here will almost be like writing an article about it, at the very least.
However, I'll brief you overall on how to approach with this.
I don't know how a "simple" CI/CD environment might look, but whatever it looks like, it feels like overkill.
Simplicity & complexity are relative terms. A system which is complicated for one might be simple for another. CI/CD doesn't define any laws that you need to follow in order to create a perfect deployment procedure, as everyone's deployment requirement is unique (at some point).
If I mention it in bullet points, what you need to figure out before you start with setting up CI&CD, is -
The sequence of steps your deployment procedure needs in order to deploy your latest version. As you have stated already that you've been doing deployment manually, that means you already know your steps. All you need to do is to fine-tune each step so that it shouldn't require manual intervention while being executed automatically by the CI program.
Choose a CI program, like Travis CI, Circle CI, or if you're using GitHub, it has it's own GitHub Actions for the purpose, you can read their documentation for more details. Your CI program will be responsible for executing your deployment steps which you'll mention to it in whichever format it understands (mostly .yml).
The CI program will execute your steps on behalf of you based on the condition which you'll provide, (like when code is pushed on prod branch). It will execute the commands on a machine (like your EC2), specifically, GitHub actions runner will be responsible for running your commands on your machine, the runner should be setup beforehand in the instance you intend to deploy your code on. More details on runners can be found in relevant documentations.
Since the runner will actually execute the commands on your machine, make sure that all required commands and parameters, including the concerned files & directories are accessible to the runner program, from permissions point of view at least. For example, running your npx prisma generate command should require that npx command is available and executable in the system, and the concerned folders in which the command will CRUD files is accessible by the runner program. Similarly for all other commands.
Get your hands on bash scripting as well.
If your steps contain dynamic info, like the one you mentioned that in your package.json an npm script needs to be updated, then a custom bash script created to update the same automatically will help, for instance. There will be however, several other ways depending on the specific nature of the dynamic changes.
The above points are huge (by huge, I mean astronomically huge) oversimplification of the ways through which CI&CD pipelines are setup. But I hope you get the idea of it at least.
In particular, it feels like I shouldn't be building on the production server.
Your feeling is legitimate. You should replicate your production environment (including deployment procedures) into a separate development environment as close as possible, in order to have all your experiments, development and testing done separately from production environment, and after successful evaluation on the development environment, deploy on production one. Steps like building will most likely be done on both environments, as it is something your program needs to run, irrespective of the environment it is running in. Your future team will appreciate this separation of environments.
if/when I have a whole team working on this, or sufficient users that I have multiple load-balanced servers and need production to be continually up.
Again, this small statement in itself is a proper domain of IT department, known as System Design, in which, to put it simply, you or your team will create an architecture for your whole system which will support your business requirements and scaling as your audience increases, which is something a simple Stackoverflow QnA won't suffice to explain.
Therefore,
or go research how to do CI/CD?
is what I'd recommend and you should also feel is the right way ahead, after reading everything above.
Useful references to begin with (not endorsing any resources, you can search for relevant/better resources too)
GitHub Actions self-hosted runners
System Design - Getting started
Bash scripting
Development, Staging, Production

Use Google App Engine or Google Cloud Compute VM to Test Run My App?

I'm moving my Three.js app and its customized node.js environment, which I've been running on my local machine to Google Cloud. I want to test things out there, and hopefully soon get some early alpha testing going with other people.
I'm not sure which is the wiser way to go... to upload the repo I've been running locally as-is onto a VM which users would then access via the VM's external IP until I get a good name to call this app... or merge my local node.js environment with what's available via the Google App Engine and run it on GAE.
Issues I'm running into with the linux VM approach... I'm not sure how to do the equivalent on the VM of what I've been doing locally. In Windows Powershell I cd into the app directory and then enter node index.js. I'm assuming by this method of deployment that I can get the app running as soon as the browser hits the external IP. I should mention too that the app will allow users to save content as well as upload images, and eventually, 3D models as well as json datasets.
Issues I'm running into with the App Engine approach: it looks like I only have access to a linux-based command line, and have to install all the node.js modules manually. Meanwhile I have a bunch of files to upload, both the server-side node files and all the frontend stuff. I don't see where to upload those files, and ultimately what I'd like to do is have access to a visual, editable file-tree interface, as I have in Windows and FileZilla, so I can swap files in and out, etc. Alternatively I suppose I could import a repo from Github? Github would be fine as long as I can visually see what's happening. Is there a visual interface for file structure available in GAE somewhere? Am I missing something?
I went through the GAE "Hello World" tutorial and that worked fine, but was left scratching my head afterward regarding how to actually see and edit the guts of the tutorial app, or even where to look for the files.
So first off, I want to determine what's the better approach, and then if possible, determine how to make the experience of getting my app up there and running a more visual, user-friendly experience.
Thanks.
There are many things to consider when choosing how to run an app, but my instinct for your use case is to simply use a VM on GCE. The most compelling reason for this is that it's the most similar thing to what you have now. You can SSH into the machine and run nohup node index.js & (or node index.js inside tmux/screen if you prefer) and it will start the app and not stop it when you log out of SSH. You can use SCP / SFTP with whatever GUI client you want to upload files. You don't have to learn anything new! If you wanted to, you could even use a Windows VM (although I think you have to pay a little more than for a comparable Linux VM due to the licensing fees).
That said, the other way is arguably more "correct" by modern development standards, but it will involve a lot more learning that will prevent you from getting your app running somewhere other than your laptop in the short term:
First, you'll need to learn about Docker and stateless containers, which is basically what your app runs inside of on AppEngine.
Next, you'll need to learn how to hook up a separate stateful service (database, file server, ...) to your app's container so you can store your files, etc. in it, and then probably rewrite your app somewhat to use it to store stuff.
Next, you'll probably want some way to automatically deploy this from code instead of manually doing it, which gets you into build systems, package managers, artifact storage, continuous integration systems, and on and on and on.
This latter path is certainly what you should choose for a long-running production service if you work with a big team of developers -- but that doesn't mean that it's necessarily the right path for your project today. If you don't care about scaling up automatically, load balancing between nodes, redundant copies of your app running in different regions in case there's a natural disaster, etc., then go with the easy way for now, and you can learn new ways to improve the service when they're actually needed.

How do I pass in the google cloud project to the SHC BigTable connector at runtime?

I'm trying to access BigTable from Spark (Dataproc). I tried several different methods and SHC seems to be the cleanest for what I am trying to do and performs well.
https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
However this approach requires that I put the Google cloud project ID in hbase-site.xml which means I need to build separate versons of the fat jar file with my spark code for each env I run on (prod, staging, etc.) which is something I'd like to avoid.
Is there a way for me to pass in the google cloud project id at runtime?
As far as I can tell, the SHC library does not let you pass through hbase configs (looking in here).
The easiest thing would be to run an init action that gets the VM's project id from VM metadata and sets it in hbase-site.xml. We are working on an initialization that does that and installs the Hbase client for Bigtable. Check out the in-progress pull request, which would be a good starting point if you needed to write one immediately. Otherwise, I expect the PR to get merged in the next couple weeks.
Alternatively, consider adding an option in SHC for passing through properties to the HBaseConfiguration it creates. That would be a valuable feature for the broader community.

Running (& compiling) untrusted user code

I want to create a application that contains a feature that allows users to submit code and the server will compile and run it, similar to Ideone & Spoj. How do I do this securely in a scalable manner?
Partial Solutions I'm aware of:
IDEA 1 - 3rd Party Services
The Sphere Engine. However this costs a LOT of money!
I'm not aware of any open source application I can run on my server to achieve this, or a cheaper alternative. Please correct me if i'm wrong.
IDEA 2 - VM
This would be the next most sensible choice. However, I'm unsure how to implement it. For example let's say I created a VM and started to run the user's code. This would restrict damage on MY system, but not the damage on the VM, which other users would have to use. Does that mean I have to create a new VM each and every time I want to compile and run user's code (which clearly is not scalable - correct me if I'm wrong.
Having not set up a thing, I assumed that services like TravisCI (which compiles code and runs it under test cases you provide), have a base virtual machine image, which boots up and processes your code. The next user to come along gets a separate VM booted from the same base image, your changes aren't stored.
So inside the VM, the user code can do whatever. All of its effects, except stuff written to the console will be erased at the end of the time limit.

Running IIS server with Coypu and SpecFlow

I have already spending a lot of time googling for some solution but I'm helpless !
I got an MVC application and I'm trying to do "integration testing" for my Views using Coypu and SpecFlow. But I don't know how I should manage IIS server for this. Is there a way to actually run the server (first start of tests) and making the server use a special "test" DB (for example an in-memory RavenDB) emptied after each scenario (and filled during the background).
Is there a better or simpler way to do this?
I'm fairly new to this too, so take the answers with a pinch of salt, but as noone else has answered...
Is there a way to actually run the server (first start of tests) ...
You could use IIS Express, which can be called via the command line. You can spin up your website before any tests run (which I believe you can do with the [BeforeTestRun] attribute in SpecFlow) with a call via System.Diagnostics.Process.
The actual command line would be something like e.g.
iisexpress.exe /path:c:\iisexpress\<your-site-published-to-filepath> /port:<anyport> /clr:v2.0
... and making the server use a special "test" DB (for example an in-memory RavenDB) emptied after each scenario (and filled during the background).
In order to use a special test DB, I guess it depends how your data access is working. If you can swap in an in-memory DB fairly easily then I guess you could do that. Although my understanding is that integration tests should be as close to production env as possible, so if possible use the same DBMS you're using in production.
What I'm doing is just doing a data restore to my test DB from a known backup of the prod DB, each time before the tests run. I can again call this via command-line/Process before my tests run. For my DB it's a fairly small dataset, and I can restore just the tables relevant to my tests, so this overhead isn't too prohibitive for integration tests. (It wouldn't be acceptable for unit tests however, which is where you would probably have mock repositories or in-memory data.)
Since you're already using SpecFlow take a look at SpecRun (http://www.specrun.com/).
It's a test runner which is designed for SpecFlow tests and adds all sorts of capabilities, from small conveniences like better formatting of the Test names in the Test Explorer to support for running the same SpecFlow test against multiple targets and config file transformations.
With SpecRun you define a "Profile" which will be used to run your tests, not dissimilar to the VS .runsettings file. In there you can specify:
<DeploymentTransformation>
<Steps>
<IISExpress webAppFolder="..\..\MyProject.Web" port="5555"/>
</Steps>
</DeploymentTransformation>
SpecRun will then start up an IISExpress instance running that Website before running your tests. In the same place you can also set up custom Deployment Transformations (using the standard App.Config transformations) to override the connection strings in your app's Web.config so that it points to the in-memory DB.
The only problem I've had with SpecRun is that the documentation isn't great, there are lots of video demonstrations but I'd much rather have a few written tutorials. I guess that's what StackOverflow is here for.

Resources